Monday 18 July 2016

Transferring GO terms across species

I've written a pipeline to transfer GO terms to their orthologs in other species. Here are some notes to remind myself how to use it:

To run it for a new species (eg. T. regenti):
- Make a directory eg. /lustre/scratch108/parasites/alc/000_50HG_GO/tregenti

- Make a link to the file with the orthologs in the reference species:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/trichobilharzia_regenti_orths.txt

- Make a link to the file with a protein list for this species:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/trichobilharzia_regenti_protein_list.txt

- Make a file species_list.txt with just the name of the species of interest, eg. trichobilharzia_regenti

- Make a link to the file with locus tag prefixes:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/species_locustagprefix

 - Make links to the GO annotation and ontology files:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.fb
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.goa_human
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.WS243.wb.c_elegans
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.zfin
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/go-basic.obo
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/go_terms_files

- Make links to files with mappings between identifiers:
% ln -s  /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/ensembl_zfish_ids
% ln -s  /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/uniprot_human_ids2
% ln -s /lustre/scratch108/parasites/bh4/Compara/50HG_Compara_75_Final/final_id_mapping/trichobilharzia_regenti_id_mapping.txt 
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/other_id_files

- Now run the script to assign GO terms to genes:
% /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/assign_go_terms_wrapper.sh species_list.txt > bsub_go_terms_jobs   
This does:
% bsub -q small -E 'test -e /nfs/users/nfs_a/alc' -R "select[mem>200] rusage[mem=200]" -M200 -o go_terms.o -e go_terms.e -J GO_terms_trichobilharzia_regenti assign_go_terms_to_genes.py trichobilharzia_regenti trichobilharzia_regenti_protein_list.txt trichobilharzia_regenti_orths.txt go_terms_files other_id_files go-basic.obo /nfs/helminths02/analysis/50HGP/01INTERPRO/TRE.protein.fa.gz.fas.ipr.gz trichobilharzia_regenti_GO_terms.txt trichobilharzia_regenti_id_mapping.txt
This makes an output file called trichobilharzia_regenti_GO_terms.txt. 

Summarising the results
Count the number of genes with GO terms:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | cut -f1 | sort | uniq | wc
8719 [compared to 6215 for S. mansoni]
Count the number of genes with GO terms that came from curations in other species:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | grep -v InterPro | cut -f1 | sort | uniq | wc
   3113   

Count the number of genes with GO terms that came from InterPro domains:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | grep InterPro | cut -f1 | sort | uniq | wc
      7912      

Count the number of distinct GO terms in the file:
%  grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | cut -f4 | tr ',' '\n' | sort | uniq | wc 
2510 [compared to 2673 for S. mansoni] 
Count the number of transfers from each reference species:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | cut -f2 | grep -v InterPro | tr ',' '\n' | cut -d"(" -f2 | sort  | uniq -c
   2171 caenorhabditis_elegans)
   1979 danio_rerio)
   2735 drosophila_melanogaster)
   4560 homo_sapiens)



Notes:
- Bhavana's files are all in /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/
- Bhavana's notes on how to run the pipeline are in /nfs/helminths02/analysis/50HGP/00ANALYSES/final_GO_terms/00README

Thank you:
Thank you Bhavana Harsha for making nice notes on how to run the pipeline.

Friday 15 July 2016

Using the linux grep, sort, and awk commands

The linux grep, sort and awk commands are really useful, here are some of the nice features: (I haven't added much here yet, but plan to add as I go along..)

grep examples
% grep -f  file1 file2
Finds patterns specified in file1, in file2, eg. file1 could have a list of words, and this would search for them in file2.

Find lines of a file that have some alphabetical character:
% grep -E '[A-Za-z]' file1

Get the last 5 characters of every line of a file:
% grep -o '.....$' file1

sort examples
Sort using a tab as delimiter:
% sort -k15,15nr -t$'\t' file1

awk examples
Filter a tab-deliminted file to just take lines in which column 17 has value >=50:
% awk -F"\t" '($17 >= 50)'  file1

Filter a tab-delminited file to just take lines in which column 2 has value '727':
% awk -F"\t" '$2 == 727' file1

Filter a tab-deliminted file to just take lines in which column 2 has string 'snow':
% awk -F"\t" '($2=="snow") {print}' file1

tr examples
Get rid of the character " from all lines:
% cat file1 | tr -d '"'

find examples
Find files that have pysvg* in their name:
% find . -name 'pysvg*'
% find /nfs/users/nfs_a/alc/Documents/PythonModules/ -name 'pysvg*'
Find files with pysvg in their name:
% find . -name 'pysvg'


Finding all descendants of a GO term

I wanted to find all descendants of a GO term in the GO hierarchy. First I downloaded the hierarchy from the GO website:
% wget http://purl.obolibrary.org/obo/go/go-basic.obo

Then I wrote a little Python script to find all the descendants of a GO term:

from collections import defaultdict
import sys
import os

#====================================================================#

# define a function to record the children of each GO term in the GO hierarchy:

def read_go_children(input_go_obo_file):
    """record the children of each GO term in the GO hierarchy"""

    # first read in the input GO file, and make a list of all the GO terms, and the
    # terms below them in the GO hierarchy:
    # eg.
    # [Term]
    # id: GO:0004835
    children = defaultdict(list) # children of a particular GO term, in the hierarchy
    take = 0

    fileObj = open(input_go_obo_file, "r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split()
        if len(temp) == 1:
           if temp[0] == '[Term]':
               take = 1
        elif len(temp) >= 2 and take == 1:
            if temp[0] == 'id:':
                go = temp[1]
            elif temp[0] == 'is_a:': # eg. is_a: GO:0048308 ! organelle inheritance
                parent = temp[1]
                children[parent].append(go) # record that a child of 'parent' is term 'go'
        elif len(temp) == 0:
            take = 0
    fileObj.close()

    return children

#====================================================================#

# define a function to find all descendants of a GO term in the GO hierarchy:

def find_all_descendants(input_go_term, children):

    """find all the descendants of a GO term in the GO hierarchy
    >> find_all_descendants('GO1', {'GO1': ['GO2', 'GO3'], 'GO2': ['GO4']})
    ['GO1', 'GO2', 'GO3', 'GO4']
    """

    descendants = set()
    queue = []
    queue.append(input_go_term)
    while queue:
        node = queue.pop(0)
        if node in children and node not in descendants: # don't visit a second time
            node_children = children[node]
            queue.extend(node_children)
        descendants.add(node)

    return descendants 

#====================================================================#

def main():
   
    # check the command-line arguments:
    if len(sys.argv) != 3 or os.path.exists(sys.argv[1]) == False:
        print("Usage: %s input_go_obo_file input_go_term" % sys.argv[0])
        sys.exit(1)
    input_go_obo_file = sys.argv[1] # the input gene ontology file eg. gene_ontology.WS238.obo
    input_go_term = sys.argv[2] # the input GO term of interest

    # read in the children of each GO term in the GO hierachy:
    children = read_go_children(input_go_obo_file)

    # find all the descendants of the term 'input_go_term':
    descendants = find_all_descendants(input_go_term, children)
    descendantlist = ",".join(descendants)
    print("DESCENDANTS:",descendantlist)

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#


To run it I type, for example, to find descendants of the term GO:0008291 (acetylcholine metabolic process):
% python3 find_descendants_of_GO_term.py go-basic.obo 'GO:0008291'
This gives me:
DESCENDANTS: GO:0001507,GO:0006581,GO:0008291,GO:0008292

The descendants are:
GO:0006581 acetylcholine catabolic process 
GO:0001507 acetylcholine catabolic process in synaptic cleft (descendant of acetylcholine catabolic process)
GO:0008292 acetylcholine biosynthetic process

Hurray for Python!

Notes
- I see a discussion on biostars about doing something similar in R. I didn't check if that gives the same result.
- It turns out that you can also do this using WormMine (I don't know if this is for just C. elegans-related GO terms though)

Thursday 14 July 2016

Viewing a multiple alignment

I was looking for a quick way to view a multiple alignment, and found Mview at the EBI. Nice and easy!

Friday 8 July 2016

Calculating a conservation score for a multiple alignment

I wanted to calculate a conservation score for a multiple alignment in clustalw format, so investigated how to do this.

I found a nice biostars discussion on different ways to do this, which mentioned a program by Capra & Singh (Bioinformatics 2007). It can calculate a score using 'Jensen-Shannon divergence'. This is the one I used in the end (see below).

I also heard from Daniel Caffrey about the Pfaat software for viewing and editing multiple alignments, which calculates conservation scores based on Von Neumann entropy, Shannon entropy etc.

Calculating a conservation score using Capra & Singh's program

1. First I downloaded their program from their website. It's a Python (Python2) script, so easy to install!
(note to self: I put it in /nfs/users/nfs_a/alc/Documents/bin/conservationscore/conservation_code/score_conservation.py)

2. I had an input clustalw file called 191613.cw
(note to self: to get an alignment for a family from our 50HG database:
% perl -w get_aln_for_50HG_family_v75.pl 191613 es9_compara_homology_final2_75 > 191613.cw)
Note: I found that the score_conservation.py script by Capra & Singh can't handle '*' characters in the protein alignment, so I changed these to 'X' characters first (using 'tr '\*' 'X'). It also can't handle ':' characters in the sequence names. 

3. To calculate a score using their program:
% python score_conservation.py 191613.cw
It seems to use Python2, also it needs the input alignment file to be in the same directory as the script is installed in. By default, this scores an alignment using Jensen-Shannon divergence, and a window size of 3 (on either side of the residue), and the blosum62 background distribution.

4. The output file looks like this:
# /nfs/users/nfs_a/alc/Documents/git/helminth_scripts/scripts/compara_50HG/191613.cw -- js_divergence - window_size: 3 - background: blosum62 - seq. weighting: True - gap penalty: 1 - normalized: False
# align_column_number   score   column

0       -1000.00000     --------------------------------------------------------------------------------------------------------------------M----------------------------------------
1       -1000.00000     --------------------------------------------------------------------------------------------------------------------S----------------------------------------
...
5181    0.72827 MM-M-MLMMM-MMMMMMMMMMMMLMMMMM-MMM-MMMMMM-MMMMMM-M--MMMMMMMM---MM-M----MM--MM-MMMMLIMM-MMMMMM-MM--MMMMM-M-MMMM--MM-M-LMMMMMMM-M-MM-MM-MMMMMMM-MM--MMMMMMMMMMMM
5182    0.71347 GG-G-GGGGG-GGGGSAGGGGGGGGGGGG-GSG-GGGGGG-GGGGGG-A--GEGGGGGG---GG-G----GG--GG-GGGGGGGG-AGGGDG-GG--GKGGG-G-GGGG--GG-G-GGGGEGGG-G-GG-GG-GGGGGGG-GA--GGGGGGGGGGGG

...
 

Each column is a sequence, and each row is an alignment position. It seems some alignment positions have score of -1000. Probably the alignment positions with score of -1000 should be ignored. To get an overall alignment score you could get an alignment score that was the median of the other alignment positions' conservation scores.