Thursday 2 July 2015

Using TopGO to test for GO-term enrichment

I wanted to test whether any GO terms are enriched in my gene list compared to all genes in the genome, so decided to try topGO, since my colleagues recommend it.

How TopGO works
The details of how TopGO are published in a paper by Alexa et al (2006). TopGO takes the GO hierarchy into account when calculating enrichment, and it is thought that this will lead it to report fewer false positive results than methods that do not take the GO hierarchy into account (eg. just using a Fisher's exact test). Grossman et al (2007) et al say that if you don't take the GO hierarchy into account, you get what they call the 'inheritance problem': because the highest (more general) GO terms inherit annotations from more specific descendant terms, and this can lead to false positives.

There are two methods described in the Alexa et al (2006) paper:
(i) 'elim' : this method processes the GO terms by traversing the GO hierarchy from bottom to top, ie. it first assesses the most specific (bottom-most) GO terms, and proceeds later to more general (higher) GO terms. When it assesses a higher (more general) GO term, it discards any genes that are annotated with significantly enriched descendant GO terms (considered significant using a pre-defined P-value threshold). This method does tend to miss some true positives at higher (more general) levels of the GO hierarchy.
(ii) 'weight': this is designed to miss less true positives than the 'elim' method. The significance scores of connected nodes (a parent and its child in the GO hierarchy) are compared in order to detect the locally most significant terms in the GO hierarchy.

The current TopGO actually has four more methods:
(iii) 'classic': each GO term is tested independently, not taking the GO hierarchy into account.
(iv) 'weight01': this is the default method used by TopGO, and is a mixture of the 'elim' and 'weight' methods.
(v) 'lea': this doesn't seem to be described in the TopGO vignette or any paper, so I'm not sure what it does (?)
(vi) 'parentchild': described by Grossman et al (2007) : when assessing a GO term, it takes into accoount the annotation of terms to the current term's parents, and so reduces false positives due to the inheritance problem. You can also use this algorithm in a very nice tool called Ontologizer , which has a lovely GUI (I used it in a collaboration on gene duplicates with Shane Woods, described in his paper Woods et al (2013)). 

The 'elim' and 'weight' methods are designed to be more conservative (have less false positives) than the 'classic' method, so should have larger p-values in general. Simulation results reported by Alexa et al (2006) show that the 'weight' algorithm has less false positives than 'classic' and misses few true positives, while 'elim' has even less false positives than 'weight' but misses more true positives.  Grossman et al (2007) compare the 'elim', 'weight' and 'parentchild' algorithms, and say that each has advantages in certain scenarios.

How to run TopGO
The vignette described in detail how to use the package, but here's a summary of what I did (so I won't forget!):

R packages
It's important to have the latest version of the packages, to make sure you have bug fixes, but also because the gene ontology is read into TopGO from another R package (GO.db), so you need to make sure you have the latest version of that package.
First you need to make sure that you have the latest version of R.
Then make sure you have the latest version of Bioconductor, and the latest version of the Bioconductor packages topGO:
> source("")
> biocLite()
> biocLite("topGO")
Then check you have the latest versions of everything: (for me it should be R 3.2.0, Bioconductor 3.1 and TopGO 2.2.0 and GO.db (comes with TopGO) 3.1.2:
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)

[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base    

other attached packages:
 [1] topGO_2.20.0         SparseM_1.6          GO.db_3.1.2        
 [4] RSQLite_1.0.0        DBI_0.3.1            AnnotationDbi_1.30.1
 [7] GenomeInfoDb_1.4.1   IRanges_2.2.5        S4Vectors_0.6.1    
[10] Biobase_2.28.0       graph_1.46.0         BiocGenerics_0.14.0
[13] BiocInstaller_1.18.3

loaded via a namespace (and not attached):
[1] lattice_0.20-31 grid_3.2.0      tools_3.2.0   

Starting topGO
Once you have installed topGO you can start it by typing:
> library("topGO")

Loading your data
Reading in GO annotations for the genes
The file that you provide of gene-to-GO mappings should look be a tab-delimited file looking like this, where the first column is the gene name and the second column contains a list of GO terms (separated by a comma and space):
105778    GO:0016021, GO:0016020
166881    GO:0003674, GO:0016021, GO:0016020, GO:0008150
188691    GO:0016021, GO:0016020, GO:0001584, GO:0007165, GO:0004984, GO:0007186, GO:0004872
184033    GO:0005737, GO:0005515

 For example, if your file is called 'annotations.txt', you can read it in using:
> geneID2GO <- readMappings(file = "annotations.txt") 
To have the fastest run-time, for a particular gene, it shouldn't be the case that one of the GO terms is an ancestor of one of the other GO terms for that gene. However, TopGO will run fine if this is the case, although it will be slower. In practice, I found that I didn't need to prune my lists of GO terms like this, as TopGO was pretty fast for my data set.

GO hierarchy
The GO hierarchy used by topGO is the one that comes with the GO.db package. Therefore the user does not need to provide this.

Defining your list of genes of interest, and the 'gene universe' to compare it to
You need to tell TopGO what is your list of genes of interest, and what is the list of all genes in the 'gene universe' that you want to compare your list of interest to.
The 'gene universe' can be all the genes that contain GO annotations in your input 'annotations.txt' file (see above):
> geneUniverse <- names(geneID2GO) 
We can read in a list of interesting genes from a file, just containing a list of interesting genes, one per line:
> genesOfInterest <- read.table("interestinggenes.txt",header=FALSE)
> genesOfInterest <- as.character(genesOfInterest$V1)
Then we need to tell TopGO where these interesting genes appear in the 'geneUniverse' vector:
> geneList <- factor(as.integer(geneUniverse %in% genesOfInterest))
> names(geneList) <- geneUniverse
The geneList object tells TopGO which genes in the gene universe are your genes of interest.

Putting the data together into an R object
We need to put our data in an object of type 'topGOdata'. This will contain the list of genes of interest, the GO annotations, and the GO hierarchy itself.
> myGOdata <- new("topGOdata", description="My project", ontology="BP", allGenes=geneList,  annot = annFUN.gene2GO, gene2GO = geneID2GO)
The 'ontology' argument can be 'BP' (biological process), 'MF' (molecular function), or 'CC' (cellular component).
The 'description' argument has a short description of your project.
The 'allGenes' argument specifies all the genes in the 'gene universe', and which of those are  your genes of interest.
The 'annot' argument tells topGO how to map genes to GO annotations. 'annot'=annFUN.gene2GO means that the user provides gene-to-GO annotations, and we specify here that they are in object 'geneID2GO' (see above for how this was created).
An optional argument is 'nodeSize': this is used to prune the GO hierarchy, eg. nodesize=10 prunes the GO hierarchy, to remove terms which have less than 10 annotated genes.

TopGO generally only seems to use a subset of my 'gene universe' data, which it considers 'feasible'. I think this is because you may have some genes in your input 'gene universe' that don't have any GO terms annotated to them. Also, you might have some terms in the GO annotations for your gene universe that don't exist in the version of the GO hierarchy that TopGO uses (which comes from the GO.db R package). Unfortunately this means that TopGO actually only uses some of your input data, and throws some away. Note that also, you may have chosen to use 'BP' (biological process) terms but many of your input annotations may be molecular function (MF) or cellular component (CC) terms.

If you now type 'myGOdata', you get a summary of the data in it:
> myGOdata 
   -  My project
   -  BP
 4 available genes (all genes from the array):
   - symbol:  105778 166881 188691 184033  ...
   - 2  significant genes.
 2 feasible genes (genes that can be used in the analysis):
   - symbol:  166881 188691  ...
   - 1  significant genes.
 GO graph (nodes with at least  1  genes):
   - a graph with directed edges
   - number of nodes = 15
   - number of edges = 22 

Here 2 of the 4 genes in the 'gene universe' have been mapped to the GO ontology (there are 2 'feasible' genes), so the rest of the analysis will be based on just these genes. TopGO calls the genes of interest 'significant genes'. There are 2 genes of interest in the full gene universe of 4 genes, but just one gene of interest among the 2 feasible genes. The 'myGOdata' object takes some time to build, especially if you have a large gene universe. I found it pretty fast though, less than one minute for a gene universe of ~100,000 genes. I guess the run-time might depend on how much memory (RAM) you have available(?)

The list of genes of interest can be accessed using the method sigGenes():
> sg <- sigGenes(myGOdata)
> str(sg)
> numSigGenes(myGOdata)

Performing enrichment tests
One type of test that topGO does is a Fisher's exact test based on gene counts:
> resultFisher <- runTest(myGOdata, algorithm="classic", statistic="fisher")
Selecting 'algorithm=classic' means that the GO hierarchy isn't taken into account, so each GO term is tested independently. (Usually in fact we want to take the GO hierarchy into account, so would use algorithm='weight01' for example.)
You can type 'resultFisher' to get a summary of the results:
Description: My project
Ontology: BP
'classic' algorithm with the 'fisher' test
15 GO terms scored: 0 terms with p < 0.01
Annotation data:
    Annotated genes: 2
    Significant genes: 1
    Min. no. of genes annotated to a GO: 1
    Nontrivial nodes: 1

The p-values have not been corrected for multiple testing.

We can list the top ten significant results found:
> allRes <- GenTable(myGOdata, classicFisher = resultFisher, orderBy = "resultFisher", ranksOf = "classicFisher", topNodes = 10)
        GO.ID                                        Term Annotated Significant
1  GO:0005525                                 GTP binding         3           2
2  GO:0019001                   guanyl nucleotide binding         3           2
3  GO:0032561               guanyl ribonucleotide binding         3           2
4  GO:0003746      translation elongation factor activity         1           1
5  GO:0005126                   cytokine receptor binding         1           1
6  GO:0005132      interferon-alpha/beta receptor binding         1           1
7  GO:0008135 translation factor activity, nucleic aci...         1           1
8  GO:0008332 low voltage-gated calcium channel activi...         1           1
9  GO:0016667 oxidoreductase activity, acting on a sul...         1           1
10 GO:0016670 oxidoreductase activity, acting on a sul...         1           1

   Expected classicFisher
1       0.3         0.026
2       0.3         0.026
3       0.3         0.026
4       0.1         0.101
5       0.1         0.101
6       0.1         0.101
7       0.1         0.101
8       0.1         0.101
9       0.1         0.101
10      0.1         0.101

We can visualise the position of the statistically significant GO terms in the GO hierarchy by using the following functions:
> showSigOfNodes(myGOdata, score(resultFisher), firstSigNodes = 5, useInfo ='all')
> printGraph(myGOdata, resultFisher, firstSigNodes = 5, fn.prefix = "tGO", useInfo = "all", pdfSW = TRUE)
The second command makes a pdf file ("tGO_classic_5_all.pdf") with the picture. The significant GO terms are shown as rectangles in the picture. The most significant terms are coloured red and least significant in yellow:

Finding the genes annotated with significant GO terms
If you use the weight01 (default), elim or parentchild algorithm, TopGO may report a GO term as significantly enriched which wasn't actually in your input annotations, but rather is an ancestor (in the GO hierarchy) of a GO term in your input annotations.

The number of GO terms in the TopGO subset of the GO hierarchy (the GO terms annotated to genes in the gene 'universe' input file, plus the ancestors of those GO terms in the GO hierarchy) can be found using:
> length(usedGO(myGOdata))
[1] 2398
This should agree with the 'number of nodes' given for the GO graph when you type 'myGOdata', as it is the number of nodes in TopGO's internal version of the GO hierarchy (which is filtered just to contain your terms annotated to your gene universe, plus their ancestors).

You may find that TopGO says that terms "GO:0007610", "GO:0014070", "GO:0045910", but how do we find the genes that TopGO is considering as annotated with those terms, without having a good look at the GO hierarchy ourselves?

After a bit of (slightly painful, as I'm not too good at R!) fiddling with TopGO, I found out that to list the genes annotated to a particular set of GO terms, we can type something like this:
> myterms = c("GO:0007610", "GO:0014070", "GO:0045910")
> mygenes <- genesInTerm(myGOdata, myterms)
> for (i in 1:length(myterms))
       myterm <- myterms[i]
       mygenesforterm <- mygenes[myterm][[1]]
       mygenesforterm <- paste(mygenesforterm, collapse=',')
[1] "Term GO:0007610 genes: 173785,194776,210700,213042,222387,257151,272506,315440,347686,357493,361031,367238,397820,404238,435988,436908,533191,574720,575998,604672,615430,698226,718388,721238,759288,759468,759918,858210,866451,94187"
[1] "Term GO:0014070 genes: 1050444,1051974,1058304,1119532,1137071,12391,163450,263383,364148,454323,725603,737096,737144,765031,768366,892110,935204,961708,989693"
[1] "Term GO:0045910 genes: 215658"

Running TopGO using an R script
You can run TopGO using an R script by using a command such as:
Rscript Rscript1.R annotations.txt interestinggenes.txt output_file.txt
where Rscript1.R is your R script, annotations.txt and interestinggenes.txt are the files defined as above, and output_file.txt is the file to save R output in.

Rscript1.R has this text, and executable permissions:
# Rscript1.R
# eg. run:
# Rscript Rscript1.R annotations.txt interestinggenes.txt output_file

args <- commandArgs(trailingOnly = TRUE)
universeFile = args[1]
interestingGenesFile = args[2]
output_file = args[3]

# set the output file

# load topGO

# read in the 'gene universe' file
geneID2GO <- readMappings(file = universeFile)
geneUniverse <- names(geneID2GO)

# read in the genes of interest
genesOfInterest <- read.table(interestingGenesFile,header=FALSE)
genesOfInterest <- as.character(genesOfInterest$V1)
geneList <- factor(as.integer(geneUniverse %in% genesOfInterest))
names(geneList) <- geneUniverse

# build the GOdata object in topGO
myGOdata <- new("topGOdata", description="My project", ontology="BP", allGenes=geneList,  annot = annFUN.gene2GO, gene2GO = geneID2GO)

# run the Fisher's exact tests
resultClassic <- runTest(myGOdata, algorithm="classic", statistic="fisher")
resultElim <- runTest(myGOdata, algorithm="elim", statistic="fisher")
resultTopgo <- runTest(myGOdata, algorithm="weight01", statistic="fisher")
resultParentchild <- runTest(myGOdata, algorithm="parentchild", statistic="fisher")

# see how many results we get where weight01 gives a P-value <= 0.001:
mysummary <- summary(attributes(resultTopgo)$score <= 0.001)
numsignif <- as.integer(mysummary[[3]]) # how many terms is it true that P <= 0.001

# print out the top 'numsignif' results:
allRes <- GenTable(myGOdata, classicFisher = resultClassic, elimFisher = resultElim, topgoFisher = resultTopgo, parentchildFisher = resultParentchild, orderBy = "topgoFisher", ranksOf = "classicFisher", topNodes = numsignif)

# print a graph (to a pdf file) with the top 'numsignif' results:
output_file2 = paste(output_file,"Topgo", sep="_")
printGraph(myGOdata, resultTopgo, firstSigNodes = numsignif, fn.prefix = output_file2, useInfo = "all", pdfSW = TRUE)

# print out the genes that are annotated with the significantly enriched GO terms:
myterms <- allRes$GO.ID
mygenes <- genesInTerm(myGOdata, myterms)
for (i in 1:length(myterms))
   myterm <- myterms[i]
   mygenesforterm <- mygenes[myterm][[1]]
   myfactor <- mygenesforterm %in% genesOfInterest # find the genes that are in the list of genes of interest
   mygenesforterm2 <- mygenesforterm[myfactor == TRUE]
   mygenesforterm2 <- paste(mygenesforterm2, collapse=',')

# close the output file

Note: 15-Feb-2022: I found a nice blogpost by Zhigang Lu, on how to give TopGO scores for each gene as input (e.g. from Seurat P-values or something like that), and use them in a KS test in TopGO to calculate enrichment P-values: see Zhigang's blogpost here.



Denis said...

Thank you so much! Extremely useful post. I saved a lot of my time by using your R code and explanations. Great job!

Unknown said...

Hi Avril

I would like to do GO enrichment in a few different ways.

1. I want to do enrichment analysis for a set of genes in a genome. Do I use the entire genome as the gene universe?
2. I would also like to compare between genomes. For instance which GO terms are enriched in one genome in comparison to another genome. What will my gene universe be here?

Thanks for your help.

Anonymous said...

Hi, I have a list differentially expressed genes and I wanted to do an GO enrichment analysis using topGO. I am having trouble making a geneID2GO file, I extracted GO terms corresponding to each gene ID using the following script. How to make the 'geneID2GO' object from the following script compatible with topGO?

#collect gene names from biomart
mart <- biomaRt::useMart(biomart = "plants_mart",
dataset = "athaliana_eg_gene",
host = '')
# Get ensembl gene ids and GO terms
GTOGO <- biomaRt::getBM(attributes = c( "ensembl_gene_id",
"go_id"), mart = mart)
#examine result
head (GTOGO)
#Remove blank entries
GTOGO <- GTOGO[GTOGO$go_id != '',]
# convert from table format to list format
geneID2GO <- by(GTOGO$go_id,
function(x) as.character(x))
#examine result
head (geneID2GO)

Yifangt said...

Hi Avril:
Can I ask your suggestion for the way to create the gene-to-GO mapping file that topGo requires?
I am working on an unusual genome and I have the whole genome CDS set.
Thanks a lot!


Unknown said...

What a useful post. Thank you Avril !

Yvan said...

Thank you for your post! Your succint explanations about the different methods are very informative.

Two or three years later, would you still use TopGo? Best, Yvan