Thursday, 28 October 2021

Using Weblogo for sequence logos

A very nice tool for creating sequence logos is the Weblogo website.
You can paste in a multiple alignment like this:

AATGGAAGTGGAAAATCTGTTAGCA
TTATATTAGGAAAATCGTTATAGCA
ATTATGAGTGGAAAATCATGTAGCA
GAAATCAATTGATAGAATATGAGCA


and get back a sequence logo, lovely!

 


 

 



Monday, 25 October 2021

Using Ontologizer for GO enrichment analysis

I previously used the Ontologizer tool for GO enrichment analysis, but that was a few years ago (see Woods et al PMID:23675306). I remember it having a nice algorithm that takes the relationship between parent and child GO terms into account. I decided to use it again, this time to test for GO enrichment among a list of Schistosoma mansoni genes.  

Installing Ontologizer

I downloaded Ontologizer onto the Sanger compute farm using:
% wget http://ontologizer.de/cmdline/Ontologizer.jar 
This is the command-line version of Ontologizer.
You can see instructions for installing it and running it here: here.

Note I remember there was also a beautiful GUI for Ontologizer. There are instructions for installing it here. However, unfortunately I wasn't able to figure out how to install it this time (I tried to install on my Mac laptop, and found it needs a file swt.jar, which I wasn't able to find on the web, the links to that seem to be broken now.) Oh well!

Preparing the inputs for Ontologizer 

First I downloaded the Gene ontology hierarchy file using:
% wget http://purl.obolibrary.org/obo/go/go-basic.obo

Next I wanted to get the GO annotations for Schistosoma mansoni. There are not curated annotations on the GO downloads page, so I used BioMart in WormBase ParaSite to get a set of predicted GO annotations:
- For the Query, I chose the genome to be Schistosoma mansoni
- I selected the Output attributes to be GO: GO term accession, GO term name, GO term definition, GO term evidence code, GO domain
- I clicked on 'Results' at the top of the webpage
- I clicked on 'Export as CSV' and got the file 'mart_export.txt', which I renamed as 'smansoni_biomart.txt'
 
For Ontologizer, the annotations need to be in a format called GAF format. I wrote a little perl script make_gaf.pl  to convert the file 'smansoni_biomart.txt' to GAF format:
% perl -w make_gaf.pl smansoni_biomart.txt  > smansoni.gaf
Here is my perl script make_gaf.pl:
 
#!/usr/local/bin/perl

# read in the GO data for S. mansoni from WormBase ParaSite Biomart:
$go_data = $ARGV[0]; # smansoni_biomart.txt

# Genome project,Gene stable ID,GO term accession,GO term name,GO term definition,GO term evidence code,GO domain
# schistosoma_mansoni_prjea36577,Smp_000020,,,,,
# schistosoma_mansoni_prjea36577,Smp_000030,GO:0000502,proteasome complex,"A large multisubunit complex which catalyzes protein degradation, found in eukaryotes, archaea and some bacteria. In eukaryotes, this complex consists of the
barrel shaped proteasome core complex and one or two associated proteins or complexes that act in regulating entry into or exit from the core.",IEA,cellular_component
# schistosoma_mansoni_prjea36577,Smp_000030,GO:0042176,regulation of protein catabolic process,"Any process that modulates the frequency, rate or extent of the chemical reactions and pathways resulting in the breakdown of a protein b
y the destruction of the native, active configuration, with or without the hydrolysis of peptide bonds.",IEA,biological_process
# schistosoma_mansoni_prjea36577,Smp_000030,GO:0030234,enzyme regulator activity,Binds to and modulates the activity of an enzyme.,IEA,molecular_function
# schistosoma_mansoni_prjea36577,Smp_000030,GO:0050790,regulation of catalytic activity,Any process that modulates the activity of an enzyme.,IEA,biological_process

print "!gaf-version: 2.1\n";
open(GO,"$go_data") || die "ERROR: cannot open $go_data\n";
while(<GO>)
{
    $line = $_;
    chomp $line;
    if (substr($line,0,3) eq 'sch' && $line =~ /GO:/)
    {
        @temp = split(/\,/,$line);
        $gene = $temp[1]; # e.g. Smp_000030
        $term = $temp[2]; # e.g. GO:0000502
        $evid = $temp[$#temp-1]; # e.g. IEA
        $type = $temp[$#temp]; # e.g. biological_process
        if    ($type eq 'biological_process') { $type = 'P';}
        elsif ($type eq 'molecular_function') { $type = 'F';}
        elsif ($type eq 'cellular_component') { $type = 'C';}
        # Note: S. mansoni has taxon_id 6183 in the NCBI Taxonomy: https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=6183
        # Column 1: DB                 :  required=1   e.g. UniProtKB or WormBase ParaSite
        # Column 2: DB Object ID       :  required=1   e.g. P12345
        # Column 3: DB Object Symbol   :  required=1   e.g. PHO3
        # Column 4: DB Object Qualifier:  optional     e.g. NOT
        # Column 5: GO ID              :  required=1   e.g. GO:0003993
        # Column 6: DB:Reference       :  required=1   e.g. PMID:2676709
        # Column 7: Evidence code      :  required=1   e.g. IMP
        # Column 8: With/From          :  optional     e.g. GO:0000346
        # Column 9: Aspect             :  required=1   e.g. F
        # Column10: DB Object name     :  optional     e.g. Toll-like receptor 4
        # Column11: DB Object synonym  :  optional     e.g. hToll
        # Column12: DB Object type     :  required=1   e.g. protein
        # Column13: Taxon              :  required=1   e.g. taxon:9606
        # Column14: Date               :  required=1   e.g. 20090118
        # Column15: Assigned By        :  required=1   e.g. SGD
        # Column16: Annotation extension: optional     e.g. part_of(CL:0000576)
        # Column17: Gene Product Form ID: optional     e.g. UniProtKB:P12345-2
        $gene_symbol = $gene."_symbol";
        $gene_alias = $gene."_alias";
        print "WB\t$gene\t$gene_symbol\tinvolved_in\t$term\tpubmed\t$evid\t\t$type\t\t$gene_alias\tgene\ttaxon:6183\t20211022\tWB\t\t\n";
    }
}
close(GO);

print STDERR "FINISHED\n";

The other things you need for Ontologizer are a list of your genes of interest, and a list of all genes (a background set). I had a list of all genes in the S. mansoni gene set as the background set (called 'all_schisto_v7'). The file with my list of genes of interest was called cluster0_genesB.

Running Ontologizer

I found that I was not able to run Ontologizer on the head node of the Sanger farm, so submitted it as a farm job. Here is the command I used:
% bsub -o cluster0_genes.o -e cluster0_genes.e -R "select[mem>1000] rusage[mem=1000]" -M1000 "/usr/bin/java -Xmx1G -jar Ontologizer.jar -g go-basic.obo -a smansoni.gaf -p all_schisto_v7 -s cluster0_genesB"
That is, the actual Ontologizer command was:
% java -jar Ontologizer.jar -g go-basic.obo -a smansoni.gaf -p all_schisto_v7 -s cluster0_genesB
 
This made an output file: table-cluster0_genesB-Parent-Child-Union-None.txt
The output file looks a bit like this:
 
ID      Pop.total       Pop.term        Study.total     Study.term      Pop.family      Study.family    nparents        is.trivial      p       p.adjusted      p.min   name
GO:0031224      10129   2072    1464    228     4152    290     2       false   1.797278481560433E-25   1.797278481560433E-25   0.0     "intrinsic component of membrane"
GO:0016020      10129   2345    1464    234     4152    290     1       false   2.0730394290778032E-19  2.0730394290778032E-19  0.0     "membrane"
GO:0017171      10129   60      1464    25      886     75      1       false   1.5850273023050352E-13  1.5850273023050352E-13  9.159281107768625E-95   "serine hydrolase activity"
GO:0006508      10129   346     1464    51      1247    82      1       false   1.471129482398609E-11   1.471129482398609E-11   6.0266E-319     "proteolysis"
GO:0008233      10129   250     1464    42      1321    89      2       false   2.535618270865147E-10   2.535618270865147E-10   1.6965153175154988E-277 "peptidase activity"
GO:0008289      10129   72      1464    16      3678    154     1       false   2.2267642693677877E-8   2.2267642693677877E-8   2.3266792066569023E-153 "lipid binding"
GO:0008236      10129   60      1464    25      250     42      2       false   4.6040476340954584E-8   4.6040476340954584E-8   2.491599802996095E-59   "serine-type peptidase activity"
GO:0004252      10129   52      1464    25      162     37      2       false   4.0367746142102156E-7   4.0367746142102156E-7   1.041762945213498E-43   "serine-type endopeptidase activity"
GO:1901564      10129   1438    1464    93      2479    118     2       false   9.21145381068937E-7     9.21145381068937E-7     0.0     "organonitrogen compound metabolic process"
...
 
By default, Ontologizer does not use any correction for multiple testing, so I wanted to correct for the fact that I was testing so many GO terms at once. I decided to use the Bonferroni correction:
% bsub -o cluster0_genes.o -e cluster0_genes.e -R "select[mem>1000] rusage[mem=1000]" -M1000 "/usr/bin/java -Xmx1G -jar Ontologizer.jar -g go-basic.obo -a smansoni.gaf -p all_schisto_v7 -s cluster0_genesB -m Bonferroni"
 
% head table-cluster0_genesB-Parent-Child-Union-Bonferroni.txt
ID      Pop.total       Pop.term        Study.total     Study.term      Pop.family      Study.family    nparents        is.trivial      p       p.adjusted      p.min   name
GO:0031224      10129   2072    1464    228     4152    290     2       false   1.797278481560433E-25   1.7595356334476639E-22  0.0     "intrinsic component of membrane"
GO:0016020      10129   2345    1464    234     4152    290     1       false   2.0730394290778032E-19  2.0295056010671693E-16  0.0     "membrane"
GO:0017171      10129   60      1464    25      886     75      1       false   1.5850273023050352E-13  1.5517417289566294E-10  9.159281107768625E-95   "serine hydrolase activity"
GO:0006508      10129   346     1464    51      1247    82      1       false   1.471129482398609E-11   1.4402357632682383E-8   6.0266E-319     "proteolysis"
GO:0008233      10129   250     1464    42      1321    89      2       false   2.535618270865147E-10   2.482370287176979E-7    1.6965153175154988E-277 "peptidase activity"
GO:0008289      10129   72      1464    16      3678    154     1       false   2.2267642693677877E-8   2.180002219711064E-5    2.3266792066569023E-153 "lipid binding"
GO:0008236      10129   60      1464    25      250     42      2       false   4.6040476340954584E-8   4.5073626337794536E-5   2.491599802996095E-59   "serine-type peptidase activity"
GO:0004252      10129   52      1464    25      162     37      2       false   4.0367746142102156E-7   3.952002347311801E-4    1.041762945213498E-43   "serine-type endopeptidase activity"
GO:1901564      10129   1438    1464    93      2479    118     2       false   9.21145381068937E-7     9.018013280664893E-4    0.0     "organonitrogen compound metabolic process"
 
More information on Ontologizer
It is also possible to produce some graphical output from the command-line version of Ontologizer, you can see details of that here.

Tuesday, 19 October 2021

Read data from an Excel file into Python using pandas

The Python pandas package can be used to read data from an Excel file into Python.

For example, I had an Excel file SimpleData.xlsx with three columns (showing the first few rows below):








To read it into Python using pandas, I first installed Pandas using Anaconda (which I had already installed on my computer, a Mac laptop):
% conda install -c conda-forge pandas
I also found that I needed a package called openpyxl to be able to read Excel using pandas: 
% conda install -c conda-forge openpyxl
Then I opened Python using:
% python3
and within the Python prompt typed:
>>> import pandas as pd
Now make a dataframe in pandas:
>>> mydata = pd.read_excel("SimpleData.xlsx")
Now print out the dataframe 'mydata':
>>> mydata
   Cmpd       MW  LogP
0    C1  277.330  3.29
1    C2  374.521  3.60
2    C3  357.360  3.56
3    C4  509.040  5.48
4    C5  424.480  3.03
..  ...      ...   ...
76  C77  954.660  0.00
77  C78  348.358  2.08
78  C79  501.070  3.65
79  C80  470.461  3.63
80  C81  302.780  4.91

[81 rows x 3 columns]

Hurray!