## Thursday 27 August 2015

### Analysis of variance in R

One-way analysis of variance (one-way ANOVA) is where you have a continuous response variable and a categorical predictor variable.

Assumptions of one-way ANOVA
The assumptions of one-way ANOVA are:
(i) normal distributions of the residuals within each treatment group
(ii) the same variance of the residuals for all treatment groups
One-way ANOVA in R
> myfactor <- c(rep('A',6),rep('B',6),rep('C',6),rep('D',6))
> myresponse <- c(164,172,168,177,156,195, 178,191,197,182,185,177, 175,193,178,171,163,176,155,166,149,164,170,168)
> fit <- lm(myresponse ~ myfactor)
> summary(fit)
Call:
lm(formula = myresponse ~ myfactor)

Residuals:
Min     1Q Median     3Q    Max
-16.00  -7.00   0.00   5.25  23.00

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  172.000      4.101  41.943   <2e-16 ***
myfactorB     13.000      5.799   2.242   0.0365 *
myfactorC      4.000      5.799   0.690   0.4983
myfactorD    -10.000      5.799  -1.724   0.1001
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.04 on 20 degrees of freedom
Multiple R-squared:  0.4478,    Adjusted R-squared:  0.365
F-statistic: 5.406 on 3 and 20 DF,  p-value: 0.006876

> anova(fit)

Analysis of Variance Table

Response: myresponse
Df   Sum Sq  Mean Sq   F value   Pr(>F)
myfactor   3    1636.5    545.5        5.4063    0.006876 **
Residuals 20   2018.0    100.9
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The 'Sum Sq' (sum of squares) column of the ANOVA table tells us that the 'explained sum of squares' (ESS) is 1636, which is a measure of how much of the overall variability in the data can be attributed to the differences in the treatment means. The 'residual sum of squares' (RSS) is 2018, which is a measure of how much of the overall variability in the data can be attributed to random variability. The 'total sum of squares' (TSS) is equal to the sum of these, 1636+2018=3654.

The 'Df' column of the table tells us that the degrees of freedom of RSS is equal to the total number of points in the data set (24) minus the number of treatment groups (4) = 24-4 = 20. The degrees of freedom of ESS is equal to the number of treatment groups (4) minus 1 ie. 4-1=3.

The 'Mean Sq' (mean square) column gives us the values of ESS/Df(ESS) = 1636/3 = 545.5, and RSS/Df(RSS) = 2018/20 = 100.9. Under the null hypothesis of equal mean response values in the treatment groups, ESS/Df(ESS) and RSS/Df(RSS) are both estimates of the variance of the response variable. However, if the null hypothesis is false, ESS/Df(ESS) is no longer an estimate of the variance, but RSS/Df(RSS) still is.

The 'F value' column gives us the F statistic, which is the ratio (ESS/Df(ESS))/(RSS/Df(RSS)) = 545.5/100.9 = 5.406. If the F statistic is very large, it means that the variability in the response variable that can be attributed to differences in the treatment means is very large, compared to the variability in the response variable that can be attributed to random variation.

We can get a P-value for the F statistic using an F(3, 20) distribution:
> 1 - pf(545.5/100.9, df1=3, df2=20)
[1] 0.006875948
We get that the P-value is 0.00688, as in the ANOVA table above. This is the P-value that the F value in a F(3, 20) distribution is equal to or greater than the observed F=5.406. The P-value is low here, so there is evidence against the null hypothesis of equal means in the treatment groups.

Checking the assumptions of one-way ANOVA in R: assumption of Normal residuals
The raw residuals can be retrieved using fit\$residuals. However, it is often useful to look at the standardized residuals, which are obtained by dividing raw residuals by their estimated standard error. The standardized residuals are in rstandard(fit). We can make a histogram of the raw
residuals:
> hist(fit\$residuals, col="orange")

In this case the histogram of residuals suggests the residuals are roughly Normal (they don't look skewed). We can also make Normal and half-Normal probability plots of the residuals (using functions from my previous blogpost):
> NormalProbPlot2(fit\$residuals)

> HalfNormalProbPlot(fit\$residuals)

The Normal and Half-Normal plots are roughly straight, again suggesting the residuals are Normal.

Checking the assumptions of one-way ANOVA in R: assumption of equal variance
It is a good idea to make boxplots of the responses in each treatment group, to check whether the assumption of equal variance is valid.
> myfactor <- c(rep('A',6),rep('B',6),rep('C',6),rep('D',6))
> myresponse <- c(164,172,168,177,156,195, 178,191,197,182,185,177, 175,193,178,171,163,176,155,166,149,164,170,168)
> boxplot(myresponse ~ myfactor, main="My response", xlab="Treatment", ylab="Response in units", col="orange")

The variances do look similar between the treatment groups, sugesting the assumption of equal variance is valid.

If there were marked differences in variance between the treatment groups, this would show up in a plot of residuals against fitted values (sample means):
> plot(fit\$fitted.values, fit\$residuals)
> lines(loess.smooth(fit\$fitted.values, fit\$residuals), col="blue", lty=2, lwd=2)
> abline(a=0,b=0)

There is no clear difference in the variance of residuals between the treatment groups.