Friday, 31 July 2015
Using R to test for equality of variances for two independent Normal variables
> x1 <- c(0.06, 0.08, 0.17, 0.12, -0.08, 0.11, 0.09, 0.03, 0.20, 0.08, 0.05, 0.02)
> var(x1)
[1] 0.005275
> x2 <- c(0.04, 0.26, 0.11, 0.08, 0.36, 0.04, -0.05, -0.01, -0.10, 0.00, 0.01, 0.11)
> var(x2)
[1] 0.01668106
> 0.005275/0.01668106
[1] 0.3162269
> var.test(x1, x2)
F test to compare two variances
data: diff1 and diff2
F = 0.31623, num df = 11, denom df = 11, p-value = 0.06884
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.09103463 1.09847706
sample estimates:
ratio of variances
0.3162269
Here the sample variances are not too different, 0.005 and 0.017, with a ratio of 0.32.
The var.test() function performs an F test to test the hypothesis that the variances are not equal, and gives a p-value of 0.06884, which gives weak evidence against the null hypothesis of equal variances. So I wouldn't reject the null hypothesis based on that.
If you like, just for fun, you can use the F test yourself, to test the hypothesis. The F test takes the ratio of sample variances (=0.3162269) as the F statistic value, and uses the F-distribution with n1-1=11 and n2-1=11 degrees of freedom, as the two samples have sizes 12 (ie. n1=12, n2=12):
> pf(0.3162269, df1=11, df2=11)
[1] 0.03441958
This gives us a one-tailed p-value, which we need to multiply by 2 to get the two-tailed p-value:
> pf(0.3162269, df1=11, df2=11) * 2
[1] 0.06883916
This gives us the same p-value as var.test(), but using home-made R, hurray!
Note that if you want to convince yourself that var.test() does the same thing as my 'pf' lines above, you can peak at the var.test() code using:
> methods(var.test)
[1] var.test.default* var.test.formula*
see '?methods' for accessing help and source code
This tells us that there is a function var.test.default, let's look at it:
> getAnywhere(var.test.default)
A single object matching ‘var.test.default’ was found
It was found in the following places
registered S3 method for var.test from namespace stats
namespace:stats
with value
function (x, y, ratio = 1, alternative = c("two.sided", "less",
"greater"), conf.level = 0.95, ...)
{
if (!((length(ratio) == 1L) && is.finite(ratio) && (ratio >
0)))
stop("'ratio' must be a single positive number")
alternative <- match.arg(alternative)
if (!((length(conf.level) == 1L) && is.finite(conf.level) &&
(conf.level > 0) && (conf.level < 1)))
stop("'conf.level' must be a single number between 0 and 1")
DNAME <- paste(deparse(substitute(x)), "and", deparse(substitute(y)))
if (inherits(x, "lm") && inherits(y, "lm")) {
DF.x <- x$df.residual
DF.y <- y$df.residual
V.x <- sum(x$residuals^2)/DF.x
V.y <- sum(y$residuals^2)/DF.y
}
else {
x <- x[is.finite(x)]
DF.x <- length(x) - 1L
if (DF.x < 1L)
stop("not enough 'x' observations")
y <- y[is.finite(y)]
DF.y <- length(y) - 1L
if (DF.y < 1L)
stop("not enough 'y' observations")
V.x <- var(x)
V.y <- var(y)
}
ESTIMATE <- V.x/V.y
STATISTIC <- ESTIMATE/ratio
PARAMETER <- c(`num df` = DF.x, `denom df` = DF.y)
PVAL <- pf(STATISTIC, DF.x, DF.y)
if (alternative == "two.sided") {
PVAL <- 2 * min(PVAL, 1 - PVAL)
BETA <- (1 - conf.level)/2
CINT <- c(ESTIMATE/qf(1 - BETA, DF.x, DF.y), ESTIMATE/qf(BETA,
DF.x, DF.y))
}
else if (alternative == "greater") {
PVAL <- 1 - PVAL
CINT <- c(ESTIMATE/qf(conf.level, DF.x, DF.y), Inf)
}
else CINT <- c(0, ESTIMATE/qf(1 - conf.level, DF.x, DF.y))
names(STATISTIC) <- "F"
names(ESTIMATE) <- names(ratio) <- "ratio of variances"
attr(CINT, "conf.level") <- conf.level
RVAL <- list(statistic = STATISTIC, parameter = PARAMETER,
p.value = PVAL, conf.int = CINT, estimate = ESTIMATE,
null.value = ratio, alternative = alternative, method = "F test to compare two variances",
data.name = DNAME)
attr(RVAL, "class") <- "htest"
return(RVAL)
}
<bytecode: 0x7f7fde050840>
<environment: namespace:stats>
You can see that in this code they are using the pf function to get the p-value:
PVAL <- pf(STATISTIC, DF.x, DF.y)
Calculating a confidence interval for the variance of a Normal distribution using R
> mydata <- rnorm(100)
As it's a standard Normal distribution, the population variance is 1.
> var(mydata)
[1] 1.091236
The sample variance is s^2 = 1.091236.
1. Using the R 'lavaan' package:
I found out about this from stackexchange.com. This uses the lavaan R package.
> install.packages("lavaan")
> library(lavaan)
> mydataframe <- as.data.frame(mydata)
> names(mydataframe) <- "x"
> mymodel <- 'x ~~ x'
> fit <- sem(mymodel, data=mydataframe, likelihood = "wishart")
> parameterEstimates(fit)
lhs op rhs est se z pvalue ci.lower ci.upper
1 x ~~ x 1.091 0.155 7.036 0 0.787 1.395
This gives us the confidence interval (0.787, 1.395) for the population variance. I assume this is a 95% confidence interval.
Note added some days later: I asked on the lavaan googlegroup, and they said that lavaan is not using the chi-squared distribution to calculate a confidence interval, it is using a Z-statistic. They said that as a result it does not calculate a very reliable confidence interval, and that this is discussed further here .
2. Using the chi-squared distribution
We can use the chi-squared distribution to provide a confidence interval for the variance of a Normal distribution. I'm not sure if this is what the 'lavaan' package does already, I don't think so however, as my results don't match those of the lavaan package (see below). As we have 100 data points (n=100), we use a chi-squared distribution with n-1=99 degrees of freedom. As we want a 95% confidence interval (alpha=0.05), we want the 0.025 (=0.05/2) and 0.975 (=1 - (0.05/2)) quantiles of the chi-squared distribution with 99 degrees of freedom:
> qchisq(0.025, df=99)
[1] 73.36108
> qchisq(.975, df=99)
[1] 128.422
Then our confidence interval is given by ( (n-1)s^2 / 128.422, (n-1)s^2 / 73.36108), where n is our sample size of 100, and s^2 is our sample variance 1.091236:
> (99*1.091236)/128.422
[1] 0.8412294
> (99*1.091236)/73.36108
[1] 1.472611
That is, using the chi-squared distribution I get a 95% confidence interval of (0.841, 1.473) for the population variance.
Let's get a 99% confidence interval:
> (99*1.091236)/qchisq(.995, df=99)
[1] 0.7772852
> (99*1.091236)/qchisq(.005, df=99)
[1] 1.6243
That is, a 99% confidence interval is (0.777, 1.624).
Monday, 27 July 2015
Python testing using doctests
An easy way to search down bugs is to test your Python code using doctests.
For example, if you have a module like this called 'module.py':
#====================================================================#
# define a function to read GO terms from a line of a file:
def read_go_terms_from_line_of_file(line):
"""read GO terms from a line of a file
>>> read_go_terms_from_line_of_file("791228 GO:0005515")
('791228', ['GO:0005515'])
>>> read_go_terms_from_line_of_file("530242 GO:0000003,GO:0004675,GO:0005024,GO:0016020")
('530242', ['GO:0000003', 'GO:0004675', 'GO:0005024', 'GO:0016020'])
>>> read_go_terms_from_line_of_file("794214 none")
('794214', [])
"""
# eg.
# 791228 GO:0005515
# 5470 GO:0002119,GO:0004652,GO:0006952,GO:0007423,GO:0009792,GO:0010171,GO:0019915,GO:0040002,GO:0040011,GO:0040018,GO:0042302,GO:0042338,GO:0043631,GO:0045138
# 530242 GO:0000003,GO:0004675,GO:0005024,GO:0016020
# 794214 none
temp = line.split()
family = temp[0]
go_terms = [] if temp[1] == "none" else temp[1].split(',')
return(family, go_terms)
#====================================================================#
if __name__ == "__main__":
import doctest
doctest.testmod()
This module has just one function 'read_go_terms_from_line_of_file' that reads the name of a gene family, and GO terms for the genes in that family, from an input file. The function takes as its input the line of a file, and returns the name of the family (a string) and a list of the GO terms. At the top of the function (in red above) are some doctests, which define what the function should return (according to the programmer) for particular example inputs. The end of the module has some code (in pink above) that says that if we run this module, by typing:
% python3 module.py
then the doctests are run. If you get no output when you run 'python3 module.py', then it means that the doctests were run, and all ran fine, and no bugs were found (hurray!) If you do get some output, then it helps you find those nasty bugs, and you can hunt them down and squash them (hurray!)
To get more detailed output regarding which tests were run and which failed/passed, type:
% python3 module.py -v
If you want to include doctests in a program (not a module)
An alternative way is to include doctests in your program, for example, your program could have the same code as above, except that the end of the program would have a 'main' function that calls the functions, and the 'doctest' lines would be removed :
def main():
read_go_terms_from_line_of_file("791228 GO:0005515")
if __name__=="__main__":
main()
Then, to run the doctests we just run the program with the "-m doctest" option:
% python -m doctest program.py
If your function returns a dictionary
Dictionaries are unsorted, so this is a problem. But you can do something like this:
>>> find_midpoint_positions({'scaff1': ['50=100','150=200','250=300'], 'scaff2': ['100=200', '202=500'], 'scaff3': ['300=500']}) == {'scaff1': [125, 225], 'scaff2': [201]}
True
Monday, 13 July 2015
Parsing a phylogenetic tree in Python using ETE2
ETE2 is currently in Python2, but apparently a Python3 version is coming soon (I hope so!)
For example, here's a simple script to read in a tree, and traverse all the nodes in a tree, and for each node print its children, its parent, and all the leaves under (decendants to) the node:
import os
import sys
from ete2 import Tree
# note: this script is in python2, as ete2 is in python2
#====================================================================#
def main():
# check the command-line arguments:
if len(sys.argv) != 2 or os.path.exists(sys.argv[1]) == False:
print("Usage: %s newick_file" % sys.argv[0])
sys.exit(1)
newick_file = sys.argv[1]
# the code below is based on examples on http://pythonhosted.org/ete2/tutorial/tutorial_trees.html
# load a tree from a file. The root node is put into 't'
t = Tree(newick_file, format=1) # format=1 means internal node names are loaded
# iterate over all the nodes in the tree, and get the children of each node:
for node in t.traverse("postorder"):
# get the children of this node:
children = node.get_children()
for child in children:
print "node %s has child %s" % (node.name, child.name)
# iterate over all the nodes in the tree, and get the parent of each node:
for node in t.traverse("postorder"):
# get the parent of this node:
if not node.is_root(): # if it is not the root node
print "node %s has parent %s" % (node.name, node.up.name)
# iterate over all the nodes in the tree, and get the descendant species of each node:
for node in t.traverse("postorder"):
if not node.is_leaf(): # if it is not a leaf
# get the descendant species (leaves):
leaves = node.get_leaves()
leaf_names = [leaf.name for leaf in leaves]
print "node %s has descendants %s" % (node.name, ','.join(leaf_names))
# iterate over all the nodes in the tree, and get the ancestor nodes of each node, back
# to the root:
for node in t.traverse("postorder"):
# get the ancestors of this node:
if not node.is_root():
ancestor = node.up
ancestors = [ancestor]
while not ancestor.is_root():
ancestor = ancestor.up
ancestors.append(ancestor)
ancestor_names = [ancestor.name for ancestor in ancestors]
print "node %s has ancestors %s" % (node.name, ','.join(ancestor_names))
print("FINISHED")
#====================================================================#
if __name__=="__main__":
main()
#====================================================================#
Python list comprehensions
# first get all the names of all the 'leaf' objects in the list 'leaves":
leaf_names = [leaf.name for leaf in leaves]
# now print out a comma-separated string of all the names of all the leaves:
print "The names are %s" % (','.join(leaf_names))
Other examples:
You have a dictionary 'mydict' of lists, get those lists and flatten them into one big list:
The values (not keys) of 'mydict' are lists.
biglist = [x for sublist in mydict.values() for x in sublist]
# mydict.values() gives a list of lists, this flattens it into one big list
You have a dictionary 'fasta_seq' whose keys are protein names, and values are protein sequences, and want the lengths of the proteins:
This gives a list with the lengths of the proteins:
lens = [len(fasta_seq[x]) for x in list(fasta_seq.keys())]
You have a list of words, called 'temp', and want to find the words in that list that do not contain 'status:', 'UniProt:', 'protein_id:', or 'locus:'
matching = [x for x in temp if 'status:' not in x and 'UniProt:' not in x and 'protein_id:' not in x and 'locus:' not in x]
You have a list 'temp' of words, and only want those words that also appear in a second list 'temp2':
seqmatch_words = [x for x in temp if x in temp2]
You have a list 'stack' of tuples, and want the first member of each tuple:
Your list is called 'stack'. This puts the first members of the tuples in a list of their own:
stack_members = [x[0] for x in stack]
You want to create a list containing 9 lists initialised to zero:
Matrix = [[0 for x in xrange(9)] for x in xrange(9)]
You want to create a dictionary with the length of each item in a list of strings:
genes = ['gene1', 'gene2', 'gene32']
mydict = {gene: len(gene) for gene in genes}
You want to create a dictionary, taking the keys from one list and the values from a second list:
genes = ['gene1', 'gene2', 'gene32']
values = [32, 22, 66]
mydict = {genes[i]: values[i] for i in range(len(genes))}
You want to create a set, taking the items in a list of strings that start with 's':
mylist = ['string1', 'hello', 'sausage']
myset = {x for x in mylist if x.startswith('s')}
You have a list of sets, and a dictionary with values for the items in the sets, and want to create a list of sublists, where each sublist have the dictionary values for the objects in a set:
mylist = [{'seq1', 'seq2'}, {'seq3', 'seq4', 'seq5'}, {'seq6'}]
mydict = {'seq1': 5, 'seq2': '4', 'seq3': '4', 'seq4': '3', 'seq5': 2, 'seq6': 2}
mybiglist = [[mydict[x] for x in sublist] for sublist in mylist]
You have two lists of equal lengths and want to merge them into a single list of tuples:
list1 = ['A', 'B', 'C', 'D']
list2 = ['a', 'b', 'c', 'd']
[(list1[i], list2[i]) for i in range(len(list1))]
[('A', 'a'), ('B', 'b'), ('C', 'c'), ('D', 'd')]
Making a quick plot of a phylogenetic tree
Here are some nice ones:
- My colleague Bhavana told me about using FigTree [note to self: this is installed on the Sanger farm]
- On the ETE2 webpage, there is a very nice TreeView tool for drawing a tree, where you can just paste in your tree, and it makes a tree picture
- On the Sanger farm, if you logged in using 'ssh -X', you can use the ETE2 command-line tool, by typing 'ete2' with a Newick tree file name, and a GUI pops up
- You can also plot a tree using a Python script that calls ETE2, for example, this script calls ETE2 and pops up a picture of the tree in a GUI (picture below): (again, for this you need to log into the Sanger farm using 'ssh -X')
import os
import sys
from ete2 import Tree
# note: this script is in python2, as ete2 is in python2
#====================================================================#
def main():
# check the command-line arguments:
if len(sys.argv) != 2 or os.path.exists(sys.argv[1]) == False:
print("Usage: %s newick_file" % sys.argv[0])
sys.exit(1)
newick_file = sys.argv[1]
# the code below is based on examples on http://pythonhosted.org/ete2/tutorial/tutorial_drawing.html
# load a tree from a file. The root node is put into 't'
t = Tree(newick_file, format=1) # format=1 means internal node names are loaded
t.show()
#====================================================================#
if __name__=="__main__":
main()
#====================================================================#
Other tools:
- My colleague Diogo also mentioned a new tool called PhyloCanvas, which he said is very fast.
- My colleage Sam Dougan mentioned Taxonium
Note to self (sorry, this is just for me!):
[To get the Newick version of a 50HG tree, for family 141670:
% get_tree_for_50HG_family_v75.pl 141670 es9_compara_homology_final2_75 newick ]
Thursday, 2 July 2015
Using TopGO to test for GO-term enrichment
How TopGO works
The details of how TopGO are published in a paper by Alexa et al (2006). TopGO takes the GO hierarchy into account when calculating enrichment, and it is thought that this will lead it to report fewer false positive results than methods that do not take the GO hierarchy into account (eg. just using a Fisher's exact test). Grossman et al (2007) et al say that if you don't take the GO hierarchy into account, you get what they call the 'inheritance problem': because the highest (more general) GO terms inherit annotations from more specific descendant terms, and this can lead to false positives.
There are two methods described in the Alexa et al (2006) paper:
(i) 'elim' : this method processes the GO terms by traversing the GO hierarchy from bottom to top, ie. it first assesses the most specific (bottom-most) GO terms, and proceeds later to more general (higher) GO terms. When it assesses a higher (more general) GO term, it discards any genes that are annotated with significantly enriched descendant GO terms (considered significant using a pre-defined P-value threshold). This method does tend to miss some true positives at higher (more general) levels of the GO hierarchy.
(ii) 'weight': this is designed to miss less true positives than the 'elim' method. The significance scores of connected nodes (a parent and its child in the GO hierarchy) are compared in order to detect the locally most significant terms in the GO hierarchy.
The current TopGO actually has four more methods:
(iii) 'classic': each GO term is tested independently, not taking the GO hierarchy into account.
(iv) 'weight01': this is the default method used by TopGO, and is a mixture of the 'elim' and 'weight' methods.
(v) 'lea': this doesn't seem to be described in the TopGO vignette or any paper, so I'm not sure what it does (?)
(vi) 'parentchild': described by Grossman et al (2007) : when assessing a GO term, it takes into accoount the annotation of terms to the current term's parents, and so reduces false positives due to the inheritance problem. You can also use this algorithm in a very nice tool called Ontologizer , which has a lovely GUI (I used it in a collaboration on gene duplicates with Shane Woods, described in his paper Woods et al (2013)).
The 'elim' and 'weight' methods are designed to be more conservative (have less false positives) than the 'classic' method, so should have larger p-values in general. Simulation results reported by Alexa et al (2006) show that the 'weight' algorithm has less false positives than 'classic' and misses few true positives, while 'elim' has even less false positives than 'weight' but misses more true positives. Grossman et al (2007) compare the 'elim', 'weight' and 'parentchild' algorithms, and say that each has advantages in certain scenarios.
How to run TopGO
The vignette described in detail how to use the package, but here's a summary of what I did (so I won't forget!):
R packages
It's important to have the latest version of the packages, to make sure you have bug fixes, but also because the gene ontology is read into TopGO from another R package (GO.db), so you need to make sure you have the latest version of that package.
First you need to make sure that you have the latest version of R.
Then make sure you have the latest version of Bioconductor, and the latest version of the Bioconductor packages topGO:
>
source("http://bioconductor.org/biocLite.R")
>
biocLite()
>
biocLite("topGO")Then check you have the latest versions of everything: (for me it should be R 3.2.0, Bioconductor 3.1 and TopGO 2.2.0 and GO.db (comes with TopGO) 3.1.2:
> sessionInfo()
R version 3.2.0 (2015-04-16)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.9.5 (Mavericks)
locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] topGO_2.20.0 SparseM_1.6 GO.db_3.1.2
[4] RSQLite_1.0.0 DBI_0.3.1 AnnotationDbi_1.30.1
[7] GenomeInfoDb_1.4.1 IRanges_2.2.5 S4Vectors_0.6.1
[10] Biobase_2.28.0 graph_1.46.0 BiocGenerics_0.14.0
[13] BiocInstaller_1.18.3
loaded via a namespace (and not attached):
[1] lattice_0.20-31 grid_3.2.0 tools_3.2.0
Starting topGO
Once you have installed topGO you can start it by typing:
> library("topGO")
Loading your data
Reading in GO annotations for the genes
The file that you provide of gene-to-GO mappings should look be a tab-delimited file looking like this, where the first column is the gene name and the second column contains a list of GO terms (separated by a comma and space):
105778 GO:0016021, GO:0016020
166881 GO:0003674, GO:0016021, GO:0016020, GO:0008150
188691 GO:0016021, GO:0016020, GO:0001584, GO:0007165, GO:0004984, GO:0007186, GO:0004872
184033 GO:0005737, GO:0005515
For example, if your file is called 'annotations.txt', you can read it in using:
> geneID2GO <- readMappings(file = "annotations.txt")
To have the fastest run-time, for a particular gene, it shouldn't be the case that one of the GO terms is an ancestor of one of the other GO terms for that gene. However, TopGO will run fine if this is the case, although it will be slower. In practice, I found that I didn't need to prune my lists of GO terms like this, as TopGO was pretty fast for my data set.
GO hierarchy
The GO hierarchy used by topGO is the one that comes with the GO.db package. Therefore the user does not need to provide this.
Defining your list of genes of interest, and the 'gene universe' to compare it to
You need to tell TopGO what is your list of genes of interest, and what is the list of all genes in the 'gene universe' that you want to compare your list of interest to.
The 'gene universe' can be all the genes that contain GO annotations in your input 'annotations.txt' file (see above):
> geneUniverse <- names(geneID2GO)
We can read in a list of interesting genes from a file, just containing a list of interesting genes, one per line:
> genesOfInterest <- read.table("interestinggenes.txt",header=FALSE)
> genesOfInterest <- as.character(genesOfInterest$V1)
Then we need to tell TopGO where these interesting genes appear in the 'geneUniverse' vector:
> geneList <- factor(as.integer(geneUniverse %in% genesOfInterest))
> names(geneList) <- geneUniverse
The geneList object tells TopGO which genes in the gene universe are your genes of interest.
Putting the data together into an R object
We need to put our data in an object of type 'topGOdata'. This will contain the list of genes of interest, the GO annotations, and the GO hierarchy itself.
> myGOdata <- new("topGOdata", description="My project", ontology="BP", allGenes=geneList, annot = annFUN.gene2GO, gene2GO = geneID2GO)
The 'ontology' argument can be 'BP' (biological process), 'MF' (molecular function), or 'CC' (cellular component).
The 'description' argument has a short description of your project.
The 'allGenes' argument specifies all the genes in the 'gene universe', and which of those are your genes of interest.
The 'annot' argument tells topGO how to map genes to GO annotations. 'annot'=annFUN.gene2GO means that the user provides gene-to-GO annotations, and we specify here that they are in object 'geneID2GO' (see above for how this was created).
An optional argument is 'nodeSize': this is used to prune the GO hierarchy, eg. nodesize=10 prunes the GO hierarchy, to remove terms which have less than 10 annotated genes.
TopGO generally only seems to use a subset of my 'gene universe' data, which it considers 'feasible'. I think this is because you may have some genes in your input 'gene universe' that don't have any GO terms annotated to them. Also, you might have some terms in the GO annotations for your gene universe that don't exist in the version of the GO hierarchy that TopGO uses (which comes from the GO.db R package). Unfortunately this means that TopGO actually only uses some of your input data, and throws some away. Note that also, you may have chosen to use 'BP' (biological process) terms but many of your input annotations may be molecular function (MF) or cellular component (CC) terms.
If you now type 'myGOdata', you get a summary of the data in it:
> myGOdata
Description:
- My project
Ontology:
- BP
4 available genes (all genes from the array):
- symbol: 105778 166881 188691 184033 ...
- 2 significant genes.
2 feasible genes (genes that can be used in the analysis):
- symbol: 166881 188691 ...
- 1 significant genes.
GO graph (nodes with at least 1 genes):
- a graph with directed edges
- number of nodes = 15
- number of edges = 22
Here 2 of the 4 genes in the 'gene universe' have been mapped to the GO ontology (there are 2 'feasible' genes), so the rest of the analysis will be based on just these genes. TopGO calls the genes of interest 'significant genes'. There are 2 genes of interest in the full gene universe of 4 genes, but just one gene of interest among the 2 feasible genes. The 'myGOdata' object takes some time to build, especially if you have a large gene universe. I found it pretty fast though, less than one minute for a gene universe of ~100,000 genes. I guess the run-time might depend on how much memory (RAM) you have available(?)
The list of genes of interest can be accessed using the method sigGenes():
> sg <- sigGenes(myGOdata)
> str(sg)
> numSigGenes(myGOdata)
Performing enrichment tests
One type of test that topGO does is a Fisher's exact test based on gene counts:
> resultFisher <- runTest(myGOdata, algorithm="classic", statistic="fisher")
Selecting 'algorithm=classic' means that the GO hierarchy isn't taken into account, so each GO term is tested independently. (Usually in fact we want to take the GO hierarchy into account, so would use algorithm='weight01' for example.)
You can type 'resultFisher' to get a summary of the results:
Description: My project
Ontology: BP
'classic' algorithm with the 'fisher' test
15 GO terms scored: 0 terms with p < 0.01
Annotation data:
Annotated genes: 2
Significant genes: 1
Min. no. of genes annotated to a GO: 1
Nontrivial nodes: 1
The p-values have not been corrected for multiple testing.
We can list the top ten significant results found:
> allRes <- GenTable(myGOdata, classicFisher = resultFisher, orderBy = "resultFisher", ranksOf = "classicFisher", topNodes = 10)
GO.ID Term Annotated Significant
1 GO:0005525 GTP binding 3 2
2 GO:0019001 guanyl nucleotide binding 3 2
3 GO:0032561 guanyl ribonucleotide binding 3 2
4 GO:0003746 translation elongation factor activity 1 1
5 GO:0005126 cytokine receptor binding 1 1
6 GO:0005132 interferon-alpha/beta receptor binding 1 1
7 GO:0008135 translation factor activity, nucleic aci... 1 1
8 GO:0008332 low voltage-gated calcium channel activi... 1 1
9 GO:0016667 oxidoreductase activity, acting on a sul... 1 1
10 GO:0016670 oxidoreductase activity, acting on a sul... 1 1
Expected classicFisher
1 0.3 0.026
2 0.3 0.026
3 0.3 0.026
4 0.1 0.101
5 0.1 0.101
6 0.1 0.101
7 0.1 0.101
8 0.1 0.101
9 0.1 0.101
10 0.1 0.101
We can visualise the position of the statistically significant GO terms in the GO hierarchy by using the following functions:
> showSigOfNodes(myGOdata, score(resultFisher), firstSigNodes = 5, useInfo ='all')
> printGraph(myGOdata, resultFisher, firstSigNodes = 5, fn.prefix = "tGO", useInfo = "all", pdfSW = TRUE)
The second command makes a pdf file ("tGO_classic_5_all.pdf") with the picture. The significant GO terms are shown as rectangles in the picture. The most significant terms are coloured red and least significant in yellow:
Finding the genes annotated with significant GO terms
If you use the weight01 (default), elim or parentchild algorithm, TopGO may report a GO term as significantly enriched which wasn't actually in your input annotations, but rather is an ancestor (in the GO hierarchy) of a GO term in your input annotations.
The number of GO terms in the TopGO subset of the GO hierarchy (the
GO terms annotated to genes in the gene 'universe' input file, plus the
ancestors of those GO terms in the GO hierarchy) can be found using:
> length(usedGO(myGOdata))
[1] 2398
This should agree with the 'number of nodes' given for the GO graph when you type 'myGOdata', as it is the number of nodes in TopGO's internal version of the GO hierarchy (which is filtered just to contain your terms annotated to your gene universe, plus their ancestors).
You may find that TopGO says that terms "GO:0007610", "GO:0014070", "GO:0045910", but how do we find the genes that TopGO is considering as annotated with those terms, without having a good look at the GO hierarchy ourselves?
After a bit of (slightly painful, as I'm not too good at R!) fiddling with TopGO, I found out that to list the genes annotated to a particular set of GO terms, we can type something like this:
> myterms = c("GO:0007610", "GO:0014070", "GO:0045910")
> mygenes <- genesInTerm(myGOdata, myterms)
> for (i in 1:length(myterms))
{
myterm <- myterms[i]
mygenesforterm <- mygenes[myterm][[1]]
mygenesforterm <- paste(mygenesforterm, collapse=',')
print(paste("Term",myterm,"genes:",mygenesforterm))
}
[1]
"Term GO:0007610 genes:
173785,194776,210700,213042,222387,257151,272506,315440,347686,357493,361031,367238,397820,404238,435988,436908,533191,574720,575998,604672,615430,698226,718388,721238,759288,759468,759918,858210,866451,94187"
[1]
"Term GO:0014070 genes:
1050444,1051974,1058304,1119532,1137071,12391,163450,263383,364148,454323,725603,737096,737144,765031,768366,892110,935204,961708,989693"
[1] "Term GO:0045910 genes: 215658"
Running TopGO using an R script
You can run TopGO using an R script by using a command such as:
Rscript Rscript1.R annotations.txt interestinggenes.txt output_file.txt
where Rscript1.R is your R script, annotations.txt and interestinggenes.txt are the files defined as above, and output_file.txt is the file to save R output in.
Rscript1.R has this text, and executable permissions:
# Rscript1.R
# eg. run:
# Rscript Rscript1.R annotations.txt interestinggenes.txt output_file
args <- commandArgs(trailingOnly = TRUE)
universeFile = args[1]
interestingGenesFile = args[2]
output_file = args[3]
# set the output file
sink(output_file)
# load topGO
library("topGO")
# read in the 'gene universe' file
geneID2GO <- readMappings(file = universeFile)
geneUniverse <- names(geneID2GO)
# read in the genes of interest
genesOfInterest <- read.table(interestingGenesFile,header=FALSE)
genesOfInterest <- as.character(genesOfInterest$V1)
geneList <- factor(as.integer(geneUniverse %in% genesOfInterest))
names(geneList) <- geneUniverse
# build the GOdata object in topGO
myGOdata <- new("topGOdata", description="My project", ontology="BP", allGenes=geneList, annot = annFUN.gene2GO, gene2GO = geneID2GO)
myGOdata
# run the Fisher's exact tests
resultClassic <- runTest(myGOdata, algorithm="classic", statistic="fisher")
resultElim <- runTest(myGOdata, algorithm="elim", statistic="fisher")
resultTopgo <- runTest(myGOdata, algorithm="weight01", statistic="fisher")
resultParentchild <- runTest(myGOdata, algorithm="parentchild", statistic="fisher")
# see how many results we get where weight01 gives a P-value <= 0.001:
mysummary <- summary(attributes(resultTopgo)$score <= 0.001)
numsignif <- as.integer(mysummary[[3]]) # how many terms is it true that P <= 0.001
# print out the top 'numsignif' results:
allRes <- GenTable(myGOdata, classicFisher = resultClassic, elimFisher = resultElim, topgoFisher = resultTopgo, parentchildFisher = resultParentchild, orderBy = "topgoFisher", ranksOf = "classicFisher", topNodes = numsignif)
allRes
# print a graph (to a pdf file) with the top 'numsignif' results:
output_file2 = paste(output_file,"Topgo", sep="_")
printGraph(myGOdata, resultTopgo, firstSigNodes = numsignif, fn.prefix = output_file2, useInfo = "all", pdfSW = TRUE)
# print out the genes that are annotated with the significantly enriched GO terms:
myterms <- allRes$GO.ID
mygenes <- genesInTerm(myGOdata, myterms)
for (i in 1:length(myterms))
{
myterm <- myterms[i]
mygenesforterm <- mygenes[myterm][[1]]
myfactor <- mygenesforterm %in% genesOfInterest # find the genes that are in the list of genes of interest
mygenesforterm2 <- mygenesforterm[myfactor == TRUE]
mygenesforterm2 <- paste(mygenesforterm2, collapse=',')
print(paste("Term",myterm,"genes:",mygenesforterm2))
}
# close the output file
sink()
Note: 15-Feb-2022: I found a nice blogpost by Zhigang Lu, on how to give TopGO scores for each gene as input (e.g. from Seurat P-values or something like that), and use them in a KS test in TopGO to calculate enrichment P-values: see Zhigang's blogpost here.