## Thursday 27 August 2015

### Analysis of variance in R

One-way analysis of variance (one-way ANOVA) is where you have a continuous response variable and a categorical predictor variable.

Assumptions of one-way ANOVA
The assumptions of one-way ANOVA are:
(i) normal distributions of the residuals within each treatment group
(ii) the same variance of the residuals for all treatment groups
One-way ANOVA in R
> myfactor <- c(rep('A',6),rep('B',6),rep('C',6),rep('D',6))
> myresponse <- c(164,172,168,177,156,195, 178,191,197,182,185,177, 175,193,178,171,163,176,155,166,149,164,170,168)
> fit <- lm(myresponse ~ myfactor)
> summary(fit)
Call:
lm(formula = myresponse ~ myfactor)

Residuals:
Min     1Q Median     3Q    Max
-16.00  -7.00   0.00   5.25  23.00

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  172.000      4.101  41.943   <2e-16 ***
myfactorB     13.000      5.799   2.242   0.0365 *
myfactorC      4.000      5.799   0.690   0.4983
myfactorD    -10.000      5.799  -1.724   0.1001
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.04 on 20 degrees of freedom
Multiple R-squared:  0.4478,    Adjusted R-squared:  0.365
F-statistic: 5.406 on 3 and 20 DF,  p-value: 0.006876

> anova(fit)

Analysis of Variance Table

Response: myresponse
Df   Sum Sq  Mean Sq   F value   Pr(>F)
myfactor   3    1636.5    545.5        5.4063    0.006876 **
Residuals 20   2018.0    100.9
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

The 'Sum Sq' (sum of squares) column of the ANOVA table tells us that the 'explained sum of squares' (ESS) is 1636, which is a measure of how much of the overall variability in the data can be attributed to the differences in the treatment means. The 'residual sum of squares' (RSS) is 2018, which is a measure of how much of the overall variability in the data can be attributed to random variability. The 'total sum of squares' (TSS) is equal to the sum of these, 1636+2018=3654.

The 'Df' column of the table tells us that the degrees of freedom of RSS is equal to the total number of points in the data set (24) minus the number of treatment groups (4) = 24-4 = 20. The degrees of freedom of ESS is equal to the number of treatment groups (4) minus 1 ie. 4-1=3.

The 'Mean Sq' (mean square) column gives us the values of ESS/Df(ESS) = 1636/3 = 545.5, and RSS/Df(RSS) = 2018/20 = 100.9. Under the null hypothesis of equal mean response values in the treatment groups, ESS/Df(ESS) and RSS/Df(RSS) are both estimates of the variance of the response variable. However, if the null hypothesis is false, ESS/Df(ESS) is no longer an estimate of the variance, but RSS/Df(RSS) still is.

The 'F value' column gives us the F statistic, which is the ratio (ESS/Df(ESS))/(RSS/Df(RSS)) = 545.5/100.9 = 5.406. If the F statistic is very large, it means that the variability in the response variable that can be attributed to differences in the treatment means is very large, compared to the variability in the response variable that can be attributed to random variation.

We can get a P-value for the F statistic using an F(3, 20) distribution:
> 1 - pf(545.5/100.9, df1=3, df2=20)
[1] 0.006875948
We get that the P-value is 0.00688, as in the ANOVA table above. This is the P-value that the F value in a F(3, 20) distribution is equal to or greater than the observed F=5.406. The P-value is low here, so there is evidence against the null hypothesis of equal means in the treatment groups.

Checking the assumptions of one-way ANOVA in R: assumption of Normal residuals
The raw residuals can be retrieved using fit\$residuals. However, it is often useful to look at the standardized residuals, which are obtained by dividing raw residuals by their estimated standard error. The standardized residuals are in rstandard(fit). We can make a histogram of the raw
residuals:
> hist(fit\$residuals, col="orange")

In this case the histogram of residuals suggests the residuals are roughly Normal (they don't look skewed). We can also make Normal and half-Normal probability plots of the residuals (using functions from my previous blogpost):
> NormalProbPlot2(fit\$residuals)

> HalfNormalProbPlot(fit\$residuals)

The Normal and Half-Normal plots are roughly straight, again suggesting the residuals are Normal.

Checking the assumptions of one-way ANOVA in R: assumption of equal variance
It is a good idea to make boxplots of the responses in each treatment group, to check whether the assumption of equal variance is valid.
> myfactor <- c(rep('A',6),rep('B',6),rep('C',6),rep('D',6))
> myresponse <- c(164,172,168,177,156,195, 178,191,197,182,185,177, 175,193,178,171,163,176,155,166,149,164,170,168)
> boxplot(myresponse ~ myfactor, main="My response", xlab="Treatment", ylab="Response in units", col="orange")

The variances do look similar between the treatment groups, sugesting the assumption of equal variance is valid.

If there were marked differences in variance between the treatment groups, this would show up in a plot of residuals against fitted values (sample means):
> plot(fit\$fitted.values, fit\$residuals)
> lines(loess.smooth(fit\$fitted.values, fit\$residuals), col="blue", lty=2, lwd=2)
> abline(a=0,b=0)

There is no clear difference in the variance of residuals between the treatment groups.

## Monday 10 August 2015

### pfam_scan.pl

I'm using the pfam_scan.pl to scan a FASTA file of proteins for matches to Pfam domains. You can download pfam_scan.pl from the Pfam group's ftp site.

Getting Pfam HMMs

To run pfam_scan.pl will need to download the following files from the Pfam ftp site
(you can get pfam_scan.pl from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/; you can get the files below from for example ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/ (go to the latest release)):

Pfam-A.hmm : Pfam-A library of HMMs
Pfam-A.hmm.dat  : contains info about each Pfam-A family
active_site.dat : contains active site info about each family (required for the -as option)

You will need to generate binary files for Pfam-A.hmm by running the following commands:
% hmmpress Pfam-A.hmm

Note that the current Pfam HMMs are in HMMER3 format, and that pfam_scan.pl works fine with HMMER3 format.

Also note that, according to Pfam's release notes, Pfam-B has been discontinued from release 28.0 onwards.

Running pfam_scan.pl

Usage:
pfam_scan.pl -fasta <fasta_file> -dir <directory location of Pfam files>
Useful options are:
-outfile <file>   : output file, otherwise send to STDOUT

eg.
pfam_scan.pl -fasta temp.pep -dir .

Output from pfam_scan.pl

Each output line contains the following information:

<seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value><significance> <clan> <predicted_active_site_residues>

Example output (with -as option):
Y74C9A.3      12    225     11    225 PF05891.8   Methyltransf_PK   Family     2   218   218    312.5   1.2e-93   1 CL0063

Note that Pfam groups together families with a common evolutionary ancestor into clans. If there are overlapping matches within a clan, pfam_scan.pl only shoes the most significant (with lowest E-value) match within the clan.

Memory and Run-time

I found that for a protein fasta file of 5000 sequences (some C. elegans protein sequences), pfam_scan.pl needed about 1000 Mb of memory (RAM) to run (I requested 2000 Mb when I submitted it to our compute farm). It took about 1 hour and 20 minutes to run.

## Friday 7 August 2015

### Fisher's exact test in R

I always have trouble remembering how to do the Fisher's exact test in R.

So here it is!

To test whether 393/1832 is significantly different from 2/36:

> fisher.test(matrix(c(393,2,1832-393,36-2),nrow=2))
Fisher's Exact Test for Count Data

data:  matrix(c(393, 2, 1832 - 393, 36 - 2), nrow = 2)
p-value = 0.02115
alternative hypothesis: true odds ratio is not equal to 1
95 percent confidence interval:
1.180687 40.036943
sample estimates:
odds ratio
4.640479

The P-value is 0.02115.