Monday, 29 April 2013

Simple linear regression in R

Simple linear regression
'Simple linear regression' is where you have one predictor variable, and one continuous response variable.

Assumptions of simple linear regression
There are 4 assumptions:
(i) that the response depends on the predictor in a linear fashion (linearity in x, as well as in the slope and intercept parameters)
(ii) that the residuals have a Normal distribution
(iii) that the residuals have the same variance for all values of the predictor
(iv) that the residuals for different data points are independent of each other
You can check assumptions (i) and (iii) by plotting the response against the predictor in a scatterplot (see below).

Fitting the least squares regression line
It is easy to do linear regression in R, for example:
> x <- c(75, 76, 77, 78, 79, 80)
> y <- c(120.5, 126.7, 129.1, 143.7, 138.7, 148.3)
> fit <- lm(y ~ x)
> summary(fit)
Call:
lm(formula = y ~ x)

Residuals:
      1       2       3       4       5       6
-0.4571  0.3257 -2.6914  6.4914 -3.9257  0.2571

Coefficients:
             Estimate         Std. Error   t value Pr(>|t|) 
(Intercept) -285.3286    74.7994  -3.815  0.01887 *
x                5.4171         0.9649      5.614  0.00495 **
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.037 on 4 degrees of freedom
Multiple R-squared: 0.8874,    Adjusted R-squared: 0.8592
F-statistic: 31.52 on 1 and 4 DF,  p-value: 0.004947 


This tells us that the formula of the fitted line is: y = 5.4171x - 285.3286.

The least squares regression line through the origin
In R, a regression has an implied intercept term.  In some cases we may want to fit a straight line that goes through the origin. To do this, we fit the model 'y ~ x + 0', for example:
> x <- c(9.5, 9.8, 5.0, 19.0, 23.0, 14.6, 15.2, 8.3, 11.4, 21.6, 11.8, 26.5, 12.1, 4.8, 22.0, 21.7, 28.2, 18.0, 12.1, 28.0)
> y <- c(10.7, 11.7, 6.5, 25.6, 29.4, 16.3, 17.2, 9.5, 18.4, 28.8, 19.7, 31.2, 16.6, 6.5, 29.0, 25.7, 40.5, 26.5, 14.2, 33.1)
> fit <- lm(y ~ x + 0)
> summary(fit)
Call:
lm(formula = y ~ x + 0)

Residuals:
   Min     1Q Median     3Q    Max
-2.994 -1.728 -0.097  1.029  4.489

Coefficients:
  Estimate   Std. Error t value Pr(>|t|)   
x  1.28907    0.03012    42.8   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.376 on 19 degrees of freedom
Multiple R-squared: 0.9897,    Adjusted R-squared: 0.9892
F-statistic:  1832 on 1 and 19 DF,  p-value: < 2.2e-16  


This tells us that the formula of the fitted line is: y = 1.28907x

Plotting the data, with the regression line and confidence intervals
Let's plot the data with the regression line:
> plot(x, y)
> abline(fit)
 














We can show a 95% confidence interval for the predicted line by typing:
> ci <- predict(fit, data.frame(x), interval="confidence")
> upper <- as.data.frame(ci)$upr
> lower <- as.data.frame(ci)$lwr
> oo <- order(x)
> points(x[oo], upper[oo], type="l", lty=2, col="green")
> points(x[oo], lower[oo], type="l", lty=2, col="green")
[Note: lty=2 makes a dashed line.]

















Oh so pretty!

Checking the assumption that the residuals are Normally distributed
In order to use the fitted model to make inferences, test hypotheses, produce confidence intervals, etc., it is common to assume that the residuals are Normally distributed. It's a good idea to check whether this assumption is valid, by making a Normal probability plot of the residuals (using the function NormalProbPlot() defined on the page I've just linked to):
> NormalProbPlot(fit$residuals)
















Here the points lie reasonably close to a straight line, so it seems plausible that the residuals are Normally distributed.

The residuals in fit$residuals are the raw residuals, but it is common to use the 'standardised deviance residuals', which for a simple linear regression are equivalent to dividing raw residuals by their estimated standard error. The reason for doing this is to calibrate the residuals so that they each have approximately a standard Normal distribution. In R, the 'standardised deviance residuals' are can be retrieved using rstandard():
> NormalProbPlot(rstandard(fit))
[Note this is for a different data set than the plot above]

















In the plot above, the response variable (y) is plotted against the residuals, but many people do it the other way around:
> NormalProbPlot2(rstandard(fit))


















Another type of Normal plot is a 'half-Normal plot', which consists of the negative half of the Normal probability plot superimposed on the positive half. That is, it of the absolute values of the residuals:
> HalfNormalProbPlot(rstandard(fit))

















Checking the assumption that the residuals have zero mean and constant variance
An assumption of linear regression is that the residuals have zero mean and constant variance. To check this assumption, it is common to make a residuals plot, of the observed residuals against the fitted value. Here I've added a loess-smoothed line showing the general trend of the data:
> plot(fit$fitted.values, fit$residuals)
> lines(loess.smooth(fit$fitted.values, fit$residuals), col="blue", lty=2, lwd=2)
> abline(a=0,b=0)




















The residuals should be scattered about zero in a random and unpatterned fashion. The residuals plot above shows no pattern; the points seem to be randomly scattered around zero. They are perhaps a bit higher for very low and very high fitted values, than for medium fitted values, but the trend isn't very strong. This suggests that it is appropriate to model this data using linear regression.

Note you can use rstandard(fit) instead of fit$residuals to make this plot with the standardised deviance residuals instead of the raw residuals.

Making a confidence interval, or prediction interval, for the response (y) for a particular (unseen) x value
Using the fitted line, we can make a confidence interval for the mean response (y) value at a particular (unseen) x value.  For example, to get a 90% confidence interval for the mean y value for an x-value of 85, we can type:
> predict.lm(fit, newdata=data.frame(x=85), interval="confidence", level=0.90)
          fit             lwr           upr
       1 175.1286 159.3057 190.9515
This tells us that the 90% confidence interval for the mean is (159.3, 191.0).

To make a prediction interval instead of a confidence interval, for the response (y) at a particular x value, we can type:
 > predict.lm(fit, newdata=data.frame(x=85), interval="prediction", level=0.90)
       fit                lwr            upr
      1 175.1286  157.1171  193.1401
This tells us that the 90% prediction interval for y when x is 85 is (157.1, 193.1).

Note that the point estimate (175.1286) is the same for the confidence interval and prediction interval. However, Note the prediction interval is wider than the confidence interval, as it is for the response value (y), rather than for the mean response value (mean y), at a particular x. That is, the confidence interval takes into account the sampling variability of the regression line, while the prediction interval takes into account both the sampling variability of the regression line, and of y (for a particular x) about its mean value. In other words, the confidence interval is an interval for alpha+85*beta where alpha is the intercept of the regression line and beta is its slope; and the prediction interval is an interval for alpha+85*beta+epsilon, where epsilon is the random error around the regression line.

The analysis of variance table for the regression
We can get the 'analysis of variance' table for the regression by typing:
> anova(fit)
Analysis of Variance Table

Response: y
                 Df   Sum Sq   Mean Sq F value   Pr(>F)  
x                1     513.55    513.55    31.518    0.004947 **
Residuals  4      65.17     16.29                   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This tells us that the residual sum of squares (sum of the squares of the residuals) is 65.17, where each residual is the difference between a y-value in the data set and the predicted y-value according to the regression line (for the corresponding x-value). This measures the amount of variability remaining in the response (y) once the regression model is fitted.

The regression sum of squares is 513.55. This is amount of variation in the response (y) that is taken account of by the linear relationship between the mean response and the predictor variable (x).

The table also gives us the mean square of the residuals as (Sum squares for Residuals)/(Df for Residuals) = 65.17/4 = 16.29. This is the estimate of the variance of the residuals (error variance or residual variance). Note that the output of the 'summary(fit)' command also gave us:
Residual standard error: 4.037 on 4 degrees of freedom
This tells us that the standard deviation of the residuals is 4.037, which is just the square root of the estimated variance of the residuals (16.29).

Testing the hypothesis that the slope of the regression line is zero
The output of the command 'summary(fit)' included this:
Coefficients:
             Estimate         Std. Error   t value Pr(>|t|) 
(Intercept) -285.3286    74.7994  -3.815  0.01887 *
x                5.4171         0.9649      5.614  0.00495 **

This uses a t-test to tell us that P-value for the hypothesis test that the slope parameter is 0 is 0.00495. So there is strong evidence against the null hypothesis that the slope is 0 (that y has no effect on x). 

Alternatively, we can look at the analysis of variance table. The output of the command 'anova(fit)' included the lines:
                 Df   Sum Sq   Mean Sq   F value   Pr(>F)  
x                1     513.55    513.55      31.518    0.004947 **

Residuals  4      65.17     16.29      
This uses an F-test to test the same hypothesis. This tells us that the test statistic used to test the hypothesis was the F-statistic: (Mean squares for regression)/(Sum squares for residuals) = 513.55/16.29 = 31.518. Under the null hypothesis that the slope is zero, this test statistic has an F distribution with degress of freedom = Df(Regression) and Df(Residuals), ie. with 1 and 4. The P-value for the F test is 0.004947, which we could have calculated in R using:
> 1 - pf(31.518, 1, 4)
[1] 0.004946915

The P-values  from the t-test and the F-test should always agree exactly (here we get 0.00495 for both).

Note that the 'summary(fit)' command also actually gives the F-statistic and result of the F-test:
F-statistic: 31.52 on 1 and 4 DF,  p-value: 0.004947 

Getting confidence intervals for the intercept and slope
To get 95% confidence intervals for the intercept and slope of the regression line, you can type:
> confint(fit, level=0.95)
                  2.5 %     97.5 %
(Intercept) -493.005016 -77.652127
x              2.738097   8.096189

This tells us that a 95% confidence interval for the intercept is (-493.0, -77.7), and that a 95% confidence interval for the slope is (2.7, 8.1).

Strength of the linear association of the response variable with the predictor variable
Part of the output of the 'summary(fit)' command is:
Multiple R-squared: 0.8874,    Adjusted R-squared: 0.8592 
This tells us that 85.92% is an estimate of the strength of the linear association, ie. of the percent of the variance of the response variable accounted for by the regression line. A perfect linear relationship would give us 100%, so 85.92% seems quite good. 

(Note to self: for multiple linear regression, to find out the %variance explained by predictor p3, make a model with predictors p1, p2, p3, and a model with predictors p1, p2 and see the difference in R^2_adj between the two models. The effect of each predictor depends on which other predictors are in the model though! Also note that the coefficients in the regression equation don't tell you which predictors are most important as the different predictor variables are often measured in different scales/units.)

 

No comments: