# Linear regression analysis

In statistics, linear regression is a linear approach for modelling the relationship between a scalar dependent variable y and a independent variables x. The case of one explanatory variable is called simple linear regression.

In GWAS, one of the most application is:

Given variables y and x that may be related to y (x, y can be copy number variation or mRNA expression value and so on), linear regression analysis can be applied to quantify the strength of the relationship between y and the x to assess whether x may have no relationship with y at all or contain redundant information about y.

Least squares, ridge regression, lasso and other methods can be used to fitted linear regression models.

Least squares is a standard approach in Linear regression. “Least squares” means that the overall solution minimizes the sum of the squares of the residuals made in the results of every single equation.

The best fit in the least-squares sense minimizes the sum of squared residuals (a residual being: the difference between an observed value, and the fitted value provided by a model). When the problem has substantial uncertainties in the independent variable (the x variable), then simple regression and least-squares methods have problems; in such cases, the methodology required for fitting errors-in-variables models may be considered instead of that for least squares. Consequently, advanced least squares methods are developed.

The Pearson correlation coefficient (PCC), also referred to as Pearson's r, is a most widely used measure of the linear correlation between two variables X and Y. It has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s.

## Overview

There are three ways to get into the Linear regression analysis page.

→1) Through the Navigation bar at the Home page, select “Linear regression analysis” under “Data Analysis”;

→2) Go to “Data Analysis” page, then go to “Data visualization” area, select “Volcano plot”;

→3) Through the link in the “Link area” at the Home page, go to “Data Analysis” page, then go to “Data mining” area, select “Linear regression analysis”.

For “Linear regression analysis” page, there are five areas:

→ Navigation bar: You can switch to other pages through this navigation bar.

→ Setting area: You can specify genes, cancer types, data types, cutoff values and other parameter details here.

→ Plotting area: The Linear model will be plotted in this area.

→ Figure Downloading and DIY area: You can download Linear model in a certain format and size. You can also customize line color and so on through the option buttons in this area.

→ Link area: Necessary links are available for you to switch to other pages or websites.

*Note: quick help can be available through putting your mouse on the small question marks besides certain options in this pages.*

## Setting area

1. It reminds you which kind of data mining analysis you are working on.

2. You can select your concern cancer type here through the drop-down list.

3. In TCGA/GDC dataset, non-malignant samples and tumor samples are not both always available for all cancer types. Available sample types vary for different data type even for the same cancer type. For example, for acute myeloid leukemia (LAML) cancer, no non-malignant samples of mRNA expression values are available, but both non-malignant and tumor samples are available for copy number variation data. Different legends are added before cancer names to tell you which kind of samples of the given cancer types can be available.

⚠: without non-malignant which means only tumor samples of this cancer type are available for the data type specified in (4) and (5).

❌: not available which means neither tumor samples nor non-malignant samples of this cancer type are available for the data type specified in (4) and (5).

4. You can specify the first sample group and whether applying log2 transformation here through the drop-down list.

5. You can specify the second sample group and whether applying log2 transformation through the drop-down list.

*Note: It needs samples of two different groups to do the t-test, if you select the same cancer type for the first and second group, please make sure the sample types of them are different. Otherwise, an error information will be displayed in the plotting area and no Manhattan plot will be created.*

6. You can input the concern gene symbols here. One gene a time. Only HUGO (Human Genome Organization) symbols are accepted. For example: EGFR, KRAS, TP63….

*Note: It needs samples of two different groups to do the linear regression analysis, please make sure the sample types of them are different. Otherwise, an error information will be displayed in the plotting area and no Manhattan plot will be created. Only matched samples in these two groups are used for the linear regression analysis.*

7. You can specify which sample type to use.

After setting all these necessary options, click “GO” button at the bottom of this area, a linear model will be created in the plotting area. Because the big data size and the calculating time for significance test, it may take seconds or minutes to do linear regression. The processing time varies according to the Internet transmitting speed and the configuration of your computer.

## Plotting area

Linear model plot figures will be shown in this area.

A toolbar will show up at the top right of this plotting area when a Linear model is created.

1. Zoom in: Rectangular zoom in tool. This tool allows you to select a region to display at full application size. After clicking this botton, your mouse will turn into a small cross. Then click and hold the left mouse button and drag a rectangle around a portion of the screen and have it zoom in.

2. Zoom out: Zoom back to the status it was a step before by cliking it.

3. Restore: Show the plots in the original portion.

4. Save as Image: You can click it to swich into a image saving webpage then click right mouse button to save this image. You also can specify the image format and size by selecting the options in the Figure downloading and DIY area.

5. Data table: If you want to download the sample data in a table, you can click this button. Then a table containing all data will show up in the plotting area like this. You can select and copy the whole table or any part of it into a word or excel file by selecting and clicking right mouse button as you usally do. You can scroll down to see the information of other samples. You also can click the “close” button at the bottom left of this page to close the table page and go back to the default page with the plotting area.

For your convinence, the sample ID and other details of each individual sample will show up when you put your mouse on the corresponding marker.

For example: in the above figure, aftering putting the mouse on a marker, a catalog showed up is:

From the left to the right are: x-axis value, y-axis value and sample ID of this sample in the corresponding two groups. Therefore, in this example: the x-axis value (mRNA expression value) is 466.19479, the y-axis value (copy number variation) is 2.0312 and the sample ID is TCGA-AB-2984-03 in LAML.

## Figure downloading and DIY area

### 1. Figure Downloading area:

You can specify image format (png or jpg) and size/dimensions for the image to download .

### 2. Figure DIY area:

You can modify colors and Y-Limits of this figure.