I previously used the Ontologizer tool for GO enrichment analysis, but that was a few years ago (see Woods et al PMID:23675306). I remember it having a nice algorithm that takes the relationship between parent and child GO terms into account. I decided to use it again, this time to test for GO enrichment among a list of Schistosoma mansoni genes.
Installing Ontologizer
I downloaded Ontologizer onto the Sanger compute farm using:
% wget http://ontologizer.de/cmdline/Ontologizer.jar
This is the command-line version of Ontologizer.
You can see instructions for installing it and running it here:
here.
Note I remember there was also a beautiful GUI for Ontologizer. There are instructions for installing it
here. However, unfortunately I wasn't able to figure out how to install it this time (I tried to install on my Mac laptop, and found it needs a file swt.jar, which I wasn't able to find on the web, the links to that seem to be broken now.) Oh well!
Preparing the inputs for Ontologizer
First I downloaded the Gene ontology hierarchy file using:
% wget http://purl.obolibrary.org/obo/go/go-basic.obo
Next I wanted to get the GO annotations for Schistosoma mansoni. There are not curated annotations on the GO downloads page, so I used BioMart in WormBase ParaSite to get a set of predicted GO annotations:
- For the Query, I chose the genome to be Schistosoma mansoni
- I selected the Output attributes to be GO: GO term accession, GO term name, GO term definition, GO term evidence code, GO domain
- I clicked on 'Results' at the top of the webpage
- I clicked on 'Export as CSV' and got the file 'mart_export.txt', which I renamed as 'smansoni_biomart.txt'
For Ontologizer, the annotations need to be in a format called
GAF format. I wrote a little perl script make_gaf.pl to convert the file 'smansoni_biomart.txt' to GAF format:
% perl -w make_gaf.pl smansoni_biomart.txt > smansoni.gaf
Here is my perl script make_gaf.pl:
#!/usr/local/bin/perl
# read in the GO data for S. mansoni from WormBase ParaSite Biomart:
$go_data = $ARGV[0]; # smansoni_biomart.txt
# Genome project,Gene stable ID,GO term accession,GO term name,GO term definition,GO term evidence code,GO domain
# schistosoma_mansoni_prjea36577,Smp_000020,,,,,
# schistosoma_mansoni_prjea36577,Smp_000030,GO:0000502,proteasome complex,"A large multisubunit complex which catalyzes protein degradation, found in eukaryotes, archaea and some bacteria. In eukaryotes, this complex consists of the
barrel shaped proteasome core complex and one or two associated proteins or complexes that act in regulating entry into or exit from the core.",IEA,cellular_component
# schistosoma_mansoni_prjea36577,Smp_000030,GO:0042176,regulation of protein catabolic process,"Any process that modulates the frequency, rate or extent of the chemical reactions and pathways resulting in the breakdown of a protein b
y the destruction of the native, active configuration, with or without the hydrolysis of peptide bonds.",IEA,biological_process
# schistosoma_mansoni_prjea36577,Smp_000030,GO:0030234,enzyme regulator activity,Binds to and modulates the activity of an enzyme.,IEA,molecular_function
# schistosoma_mansoni_prjea36577,Smp_000030,GO:0050790,regulation of catalytic activity,Any process that modulates the activity of an enzyme.,IEA,biological_process
print "!gaf-version: 2.1\n";
open(GO,"$go_data") || die "ERROR: cannot open $go_data\n";
while(<GO>)
{
$line = $_;
chomp $line;
if (substr($line,0,3) eq 'sch' && $line =~ /GO:/)
{
@temp = split(/\,/,$line);
$gene = $temp[1]; # e.g. Smp_000030
$term = $temp[2]; # e.g. GO:0000502
$evid = $temp[$#temp-1]; # e.g. IEA
$type = $temp[$#temp]; # e.g. biological_process
if ($type eq 'biological_process') { $type = 'P';}
elsif ($type eq 'molecular_function') { $type = 'F';}
elsif ($type eq 'cellular_component') { $type = 'C';}
# Note: S. mansoni has taxon_id 6183 in the NCBI Taxonomy: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6183
# Column 1: DB : required=1 e.g. UniProtKB or WormBase ParaSite
# Column 2: DB Object ID : required=1 e.g. P12345
# Column 3: DB Object Symbol : required=1 e.g. PHO3
# Column 4: DB Object Qualifier: optional e.g. NOT
# Column 5: GO ID : required=1 e.g. GO:0003993
# Column 6: DB:Reference : required=1 e.g. PMID:2676709
# Column 7: Evidence code : required=1 e.g. IMP
# Column 8: With/From : optional e.g. GO:0000346
# Column 9: Aspect : required=1 e.g. F
# Column10: DB Object name : optional e.g. Toll-like receptor 4
# Column11: DB Object synonym : optional e.g. hToll
# Column12: DB Object type : required=1 e.g. protein
# Column13: Taxon : required=1 e.g. taxon:9606
# Column14: Date : required=1 e.g. 20090118
# Column15: Assigned By : required=1 e.g. SGD
# Column16: Annotation extension: optional e.g. part_of(CL:0000576)
# Column17: Gene Product Form ID: optional e.g. UniProtKB:P12345-2
$gene_symbol = $gene."_symbol";
$gene_alias = $gene."_alias";
print "WB\t$gene\t$gene_symbol\tinvolved_in\t$term\tpubmed\t$evid\t\t$type\t\t$gene_alias\tgene\ttaxon:6183\t20211022\tWB\t\t\n";
}
}
close(GO);
print STDERR "FINISHED\n";
The other things you need for Ontologizer are a list of your genes of interest, and a list of all genes (a background set). I had a list of all genes in the S. mansoni gene set as the background set (called 'all_schisto_v7'). The file with my list of genes of interest was called cluster0_genesB.
Running Ontologizer
I found that I was not able to run Ontologizer on the head node of the Sanger farm, so submitted it as a farm job. Here is the command I used:
% bsub -o cluster0_genes.o -e cluster0_genes.e -R "select[mem>1000] rusage[mem=1000]" -M1000 "/usr/bin/java -Xmx1G -jar Ontologizer.jar -g go-basic.obo -a smansoni.gaf -p all_schisto_v7 -s cluster0_genesB"
That is, the actual Ontologizer command was:
% java -jar Ontologizer.jar -g go-basic.obo -a smansoni.gaf -p all_schisto_v7 -s cluster0_genesB
This made an output file: table-cluster0_genesB-Parent-Child-Union-None.txt
The output file looks a bit like this:
ID Pop.total Pop.term Study.total Study.term Pop.family Study.family nparents is.trivial p p.adjusted p.min name
GO:0031224 10129 2072 1464 228 4152 290 2 false 1.797278481560433E-25 1.797278481560433E-25 0.0 "intrinsic component of membrane"
GO:0016020 10129 2345 1464 234 4152 290 1 false 2.0730394290778032E-19 2.0730394290778032E-19 0.0 "membrane"
GO:0017171 10129 60 1464 25 886 75 1 false 1.5850273023050352E-13 1.5850273023050352E-13 9.159281107768625E-95 "serine hydrolase activity"
GO:0006508 10129 346 1464 51 1247 82 1 false 1.471129482398609E-11 1.471129482398609E-11 6.0266E-319 "proteolysis"
GO:0008233 10129 250 1464 42 1321 89 2 false 2.535618270865147E-10 2.535618270865147E-10 1.6965153175154988E-277 "peptidase activity"
GO:0008289 10129 72 1464 16 3678 154 1 false 2.2267642693677877E-8 2.2267642693677877E-8 2.3266792066569023E-153 "lipid binding"
GO:0008236 10129 60 1464 25 250 42 2 false 4.6040476340954584E-8 4.6040476340954584E-8 2.491599802996095E-59 "serine-type peptidase activity"
GO:0004252 10129 52 1464 25 162 37 2 false 4.0367746142102156E-7 4.0367746142102156E-7 1.041762945213498E-43 "serine-type endopeptidase activity"
GO:1901564 10129 1438 1464 93 2479 118 2 false 9.21145381068937E-7 9.21145381068937E-7 0.0 "organonitrogen compound metabolic process"
...
By default, Ontologizer does not use any correction for multiple testing, so I wanted to correct for the fact that I was testing so many GO terms at once. I decided to use the Bonferroni correction:
% bsub -o cluster0_genes.o -e cluster0_genes.e -R "select[mem>1000]
rusage[mem=1000]" -M1000 "/usr/bin/java -Xmx1G -jar Ontologizer.jar -g
go-basic.obo -a smansoni.gaf -p all_schisto_v7 -s cluster0_genesB -m Bonferroni"
% head table-cluster0_genesB-Parent-Child-Union-Bonferroni.txt
ID Pop.total Pop.term Study.total Study.term Pop.family Study.family nparents is.trivial p p.adjusted p.min name
GO:0031224 10129 2072 1464 228 4152 290 2 false 1.797278481560433E-25 1.7595356334476639E-22 0.0 "intrinsic component of membrane"
GO:0016020 10129 2345 1464 234 4152 290 1 false 2.0730394290778032E-19 2.0295056010671693E-16 0.0 "membrane"
GO:0017171 10129 60 1464 25 886 75 1 false 1.5850273023050352E-13 1.5517417289566294E-10 9.159281107768625E-95 "serine hydrolase activity"
GO:0006508 10129 346 1464 51 1247 82 1 false 1.471129482398609E-11 1.4402357632682383E-8 6.0266E-319 "proteolysis"
GO:0008233 10129 250 1464 42 1321 89 2 false 2.535618270865147E-10 2.482370287176979E-7 1.6965153175154988E-277 "peptidase activity"
GO:0008289 10129 72 1464 16 3678 154 1 false 2.2267642693677877E-8 2.180002219711064E-5 2.3266792066569023E-153 "lipid binding"
GO:0008236 10129 60 1464 25 250 42 2 false 4.6040476340954584E-8 4.5073626337794536E-5 2.491599802996095E-59 "serine-type peptidase activity"
GO:0004252 10129 52 1464 25 162 37 2 false 4.0367746142102156E-7 3.952002347311801E-4 1.041762945213498E-43 "serine-type endopeptidase activity"
GO:1901564 10129 1438 1464 93 2479 118 2 false 9.21145381068937E-7 9.018013280664893E-4 0.0 "organonitrogen compound metabolic process"
More information on Ontologizer
It is also possible to produce some graphical output from the command-line version of Ontologizer, you can see details of that here.