Monday 12 August 2013

DESeq R package for finding differential expression analysis of RNA-seq data

The DESeq 2010 paper by Anders & Huber
I've just presented the DESeq paper (Anders & Huber 2010) as a journal club paper, and have put my slides on slideshare in case they're of interest to anyone.

The main points of the paper are:

- A Poisson model underestimates the variance in RNA-seq read counts for a gene between biological samples, and this leads to false positives if you are using a Poisson model to detect differentially expressed genes.

- A Negative Binomial distribution is much better, especially for highly expressed genes, where a Poisson greatly underestimates the true variance.

- DESeq and EdgeR both use the Negative Binomial distribution to model the number of RNA-seq read counts for a gene.

- However, there are a couple of key differences between DESeq and EdgeR:
(i) DESeq estimates the sequencing depth for a library differently than EdgeR
(ii) DESeq estimates the variance in read count for a gene by assuming that it will have similar variance to genes of similar expression level (it uses a local regression of genes' variance versus expression, to estimate the variance)
[Note: my colleagues tell me that recent versions of EdgeR and DESeq seem to have changed however, and may do things more similarly nowadays.]

- According to the DESeq paper (which is now a few years old), DESeq and EdgeR have similar sensitivity for detecting differentially expressed genes, but EdgeR calls a greater number of weakly expressed genes as significant, and fewer highly expressed genes as significant, compared to DESeq. [Again, it would be nice to know if this is still the case, since these tools have been changed since publication of this paper.]

- DESeq has a clever way of estimating the sequencing depth for a library, that avoids being affected by just a few highly expressed genes. They say that the total number of reads in a library can be affected by just a few highly expressed genes, so isn't a good measure of sequencing depth. They use their measure of sequencing depth to normalise the estimated read count from a library for a particular gene, to give a more accurate measure of its expression level. My colleague Adam Reid suggested that this could be a better measure of expression level than RPKM, which is based on the total number of reads in a library (and so can be affected by just a few highly expressed genes).

Changes since the paper was published
The DESeq vignette (available here) lists some changes since the paper was published:
- the way in which the p-values are calculated has changed slightly (see the vignette for details)
- the way in which the variances (dispersions) has changed slightly (see below)
- in the original paper, a separate mean-dispersion regression was made for each condition, but in the latest version of DESeq, one dispersion value is estimated for a gene across all (replicated) conditions, and this is used to make a single mean-dispersion regression
- in the original paper, local regression was used to fit the mean-dispersion relationship. In the latest version of DESeq, a parametric regression is used instead by default.

Other nice features of DESeq
Based on reading the DESeq vignette (available here) and the paper, here is a list of other nice features of DESeq:
- It will still work if you only have biological replicates for one condition, and not for the second condition. It will even work if you don't have biological replicates for either of your two conditions (ie. just one biological replicate from each condition), although this is not recommended, as it is based on the assumption that only a small fraction of the genes are differentially expressed between conditions. 

Running DESeq
My colleague Anna Protasio suggested that a good way to learn DESeq is to work through the R vignette, available here.

Here are the basic steps, using the example in the vignette:

1) Read in the count data from a file (previously generated):
% R-3.0.0
> library(pasilla, lib="~alc/R/library")
> library(DESeq, lib="~alc/R/library")
> datafile = system.file( "extdata/pasilla_gene_counts.tsv", package="pasilla" )
> pasillaCountTable = read.table( datafile, header=TRUE, row.names=1 )
[Note to self: I ran this on pcs4, by logging in using 'ssh -Y psc4']

The pasillaCountTable data frame has the genes as rows and samples (including biological replicates, but with each set of technical replicates merged into one) as columns. The values are raw read counts.

Note: you can view the pasilla_gene_counts.tsv file here

2) Store the metadata for the data set:
> pasillaDesign = data.frame(row.names = colnames( pasillaCountTable ), condition = c( "untreated", "untreated", "untreated", "untreated", "treated", "treated", "treated" ), libType = c( "single-end", "single-end", "paired-end","paired-end", "single-end", "paired-end", "paired-end" ) )

We can extract out just the data for the paired-end samples, to keep things simple:
> pairedSamples = pasillaDesign$libType == "paired-end"
> countTable = pasillaCountTable[ , pairedSamples ]
> condition = pasillaDesign$condition[ pairedSamples ]

3) Make a DESeq 'CountDataSet' object for the data:
Now make a DESeq 'CountDataSet' object:
> cds = newCountDataSet( countTable, condition )

4) Normalise the data, by estimating the sequencing depth for each sample:  
> cds = estimateSizeFactors( cds )
> sizeFactors( cds )
 untreated3   untreated4   treated2   treated3
 0.8730966  1.0106112  1.0224517  1.1145888 

These are relative sequencing depths of the different samples. 

We can normalise the count data for genes, by dividing the raw counts by these sequencing depth factors (this gives the q_i value for gene i in a sample, described in the DESeq paper equation 6), eg.:
> head( counts( cds, normalized=TRUE ) )
                          untreated3      untreated4   treated2     treated3
FBgn0000003    0.000000        0.00000    0.00000    0.8971919
FBgn0000008   87.046493      69.26502   86.06763   62.8034302
FBgn0000014    0.000000        0.00000    0.00000    0.0000000
FBgn0000015    1.145349       1.97900    0.00000    0.0000000
FBgn0000017   4082.022370  3116.92579 3004.54278 2991.2376629
FBgn0000018   280.610404    306.74508  292.43434  276.3350930
...

This gives normalised expression level (read count) values for the genes. One normalised expression value is given for each gene in each biological replicate.

5) Estimate the variances of the expression levels values for each gene, across all samples:
The variance of a gene is estimated as the sum of two components: the uncertainty in measuring a concentration by counting reads ("shot noise") plus the variation between biological replicates for a condition (called the "dispersion" in the DESeq vignette), as given in Equation 3 in the DESeq paper.

To estimate the dispersion values for genes, we type:
> cds = estimateDispersions( cds )
The vignette explains that the 'estimateDispersions' function carries out three steps:
(i) it estimates the dispersion value for each gene: w_i across all the biological replicates for all conditions (using Equation 7 in the DESeq paper),
(ii) it fits a curve through these estimates, ie. fits a regression line between w_i for genes and the mean (across all biological replicates for all conditions) normalised expression level for the genes (q_i; see equation 6 in the DESeq paper). The vignette points out that in the paper a local regression was used, but by default the latest software version uses a parametric fit instead.
(iii) it assigns the gene a dispersion value. The vignette says that w_i is used if it is greater than the fitted value from the regression, and otherwise the fitted value from the regression is used. The vignette explains that this change has been made since the paper was published, to take into account that some genes seem to have much higher dispersion than others.
      The paper says that Equation 8 is used, which subtracts a value z_i from the fitted value.  I'm not sure if the z_i values are still used in the calculation, the vignette doesn't make this clear. (?)

Note that in the paper they described estimating a separate dispersion value for a gene in each condition (where there are several replicates for each condition), but it seems that the latest version of the software estimates just one dispersion value for a gene, across all the biological replicates from all conditions.

As a QC step, we can plot the per-gene dispersion estimates (w_i) against the mean normalised counts per gene (q_i), and overlay the fitted curve:
> plotDispEsts( cds )
6) Call differential expression between two experimental conditions ('treated' versus 'untreated' here):
> res = nbinomTest( cds, "untreated", "treated" )
This takes a minute or two to run.
> head(res)
           id               baseMean     baseMeanA    baseMeanB     foldChange   log2FoldChange      pval              padj
1 FBgn0000003    0.2242980    0.000000       0.4485959         Inf                Inf                         1.0000000      1.0000000
2 FBgn0000008   76.2956431   78.155755    74.4355310       0.9523999    -0.07036067          0.8354725      1.0000000
3 FBgn0000014    0.0000000    0.000000       0.0000000        NaN            NaN                        NA                  NA
4 FBgn0000015    0.7810873    1.562175       0.0000000        0.0000000    -Inf                        0.4160556       1.0000000
5 FBgn0000017 3298.6821506 3599.474078 2997.8902236  0.8328690    -0.26383857          0.2414208      0.8811746
6 FBgn0000018  289.0312286  293.677741   284.3847165    0.9683564    -0.04638999          0.7572819      1.0000000

where "id" is the gene name;
"baseMean" is the mean normalised expression level, averaged over all replicates from all conditions;
"baseMeanA" is the mean normalised expression level, averaged over all condition A replicates;
"baseMeanB" is the mean normalised expression level, averaged over all condition B replicates;
"foldChange" is the fold change from condition A to B;
"log2FoldChange" is the log2 of foldChange;
"pval" is the p-value;
"padj" is the p-value adjusted for multiple testing using the Benjamini-Hochberg procedure.

We can plot log2FoldChange against baseMean, with genes that are significant at a 10% false discovery rate (FDR) coloured red:
> plotMA(res)
This MA plot has what is called a 'sting-ray' shape by some of my colleagues. There are more genes of higher expression level that are called as differentially expressed, compared to genes of low expression level.

The DESeq vignette also recommends to have a look at a histogram of the p-values:
> hist(res$pval, breaks=100, col="skyblue", border="slateblue", main="")
The lower values are due to differentially expressed genes, while the p-values for genes that are not differentially expressed are uniformly distributed between 0 and 1 (except p-values for very poorly expressed genes, which are close to 1). 

To filter for significant genes, according to some threshold for false discovery rate (FDR):
> resSig = res[ res$padj < 0.1, ]
To list the most significantly differentially expressed genes:
> head( resSig[ order(resSig$pval), ] )
               id                   baseMean    baseMeanA  baseMeanB   foldChange log2FoldChange  pval               padj  
9831   FBgn0039155   463.4369     884.9640      41.90977        0.0473576      -4.400260     1.641210e-124 1.887556e-120
2366   FBgn0025111   1340.2282   311.1697      2369.28680    7.6141316       2.928680     
3.496915e-107 2.010901e-103
 612     FBgn0003360   2544.2512  4513.9457     574.55683     0.1272848      -2.973868     1.552884e-99  5.953239e-96
 3192   FBgn0029167   2551.3113  4210.9571     891.66551     0.2117489      -2.239574     4.346335e-78  1.249680e-74
10305 FBgn0039827   188.5927    357.3299      19.85557        0.0555665      -4.169641     
1.189136e-65  2.735251e-62
6948   FBgn0035085   447.2485    761.1898      133.30718      0.1751300      -2.513502      
3.145997e-56  6.030352e-53

We can tally the number of differentially expressed genes: 
> addmargins( table( res_sig = res$padj < .1) )
FALSE  TRUE   Sum
10680      821    11501

To save the output to a file, we can type:
> write.csv( res, file="My Pasilla Analysis Result Table.csv" )

7) Data quality assessment by sample clustering and visualisation
The DESeq vignette recommends that you carry out some quality assessment steps.

Firstly, you can make a heat map of variance stabilisation transformed data (see the DESeq paper for an explanation of the variance stabilising transformation), ie. a heatmap of genes:
> cdsFullBlind = estimateDispersions( cdsFull, method = "blind" )
> vsdFull = varianceStabilizingTransformation( cdsFullBlind )
> library("RColorBrewer")
> library("gplots")
> select = order(rowMeans(counts(cdsFull)), decreasing=TRUE)[1:30]
> hmcol = colorRampPalette(brewer.pal(9, "GnBu"))(100)
> heatmap.2(exprs(vsdFull)[select,], col = hmcol, trace="none", margin=c(10, 6))
This gives a heatmap for the 30 most highly expressed genes:
You can see that, for these 30 genes, the treated samples are grouped together, and the untreated samples are grouped together.

You can also make a heatmap of the samples, using the variance stabilised data:
> dists = dist( t( exprs(vsdFull) ) )
> mat = as.matrix( dists )
> rownames(mat) = colnames(mat) = with(pData(cdsFullBlind), paste(condition, libType, sep=" : "))
> heatmap.2(mat, trace="none", col = rev(hmcol), margin=c(13, 13))
It's reassuring to see that the untreated samples group with each other, as do the treated samples.

We can also make a PCA plot of the samples, using:
> print(plotPCA(vsdFull, intgroup=c("condition", "libType")))
The first principle component separates the treated and untreated samples, while the second principle component separates the paired-end and single-end samples.

8) Other topics
The DESeq vignette also covers other topics such as:
- what to do if you have multiple factors (eg. condition such as 'treated' and 'untreated', library type such as 'paired-end' and 'single-end', etc.)
- 'independent filtering' (ie. filtering out those tests from the procedure that have no, or little chance of showing significant evidence, without even looking at their test statistic) based on the overall sum of counts (independent of biological condition)
- how to perform a variance stabilising transformation (as described also in the DESeq paper), for example, if you are intending to perform a cluster analysis of the data or to plot the data. 

Thanks to my colleagues in the Parasite Genomics Team for very interesting discussion about this.

No comments: