Monday, 18 July 2016

Transferring GO terms across species

I've written a pipeline to transfer GO terms to their orthologs in other species. Here are some notes to remind myself how to use it:

To run it for a new species (eg. T. regenti):
- Make a directory eg. /lustre/scratch108/parasites/alc/000_50HG_GO/tregenti

- Make a link to the file with the orthologs in the reference species:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/trichobilharzia_regenti_orths.txt

- Make a link to the file with a protein list for this species:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/trichobilharzia_regenti_protein_list.txt

- Make a file species_list.txt with just the name of the species of interest, eg. trichobilharzia_regenti

- Make a link to the file with locus tag prefixes:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/species_locustagprefix

 - Make links to the GO annotation and ontology files:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.fb
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.goa_human
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.WS243.wb.c_elegans
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.zfin
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/go-basic.obo
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/go_terms_files

- Make links to files with mappings between identifiers:
% ln -s  /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/ensembl_zfish_ids
% ln -s  /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/uniprot_human_ids2
% ln -s /lustre/scratch108/parasites/bh4/Compara/50HG_Compara_75_Final/final_id_mapping/trichobilharzia_regenti_id_mapping.txt 
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/other_id_files

- Now run the script to assign GO terms to genes:
% /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/assign_go_terms_wrapper.sh species_list.txt > bsub_go_terms_jobs   
This does:
% bsub -q small -E 'test -e /nfs/users/nfs_a/alc' -R "select[mem>200] rusage[mem=200]" -M200 -o go_terms.o -e go_terms.e -J GO_terms_trichobilharzia_regenti assign_go_terms_to_genes.py trichobilharzia_regenti trichobilharzia_regenti_protein_list.txt trichobilharzia_regenti_orths.txt go_terms_files other_id_files go-basic.obo /nfs/helminths02/analysis/50HGP/01INTERPRO/TRE.protein.fa.gz.fas.ipr.gz trichobilharzia_regenti_GO_terms.txt trichobilharzia_regenti_id_mapping.txt
This makes an output file called trichobilharzia_regenti_GO_terms.txt. 

Summarising the results
Count the number of genes with GO terms:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | cut -f1 | sort | uniq | wc
8719 [compared to 6215 for S. mansoni]
Count the number of genes with GO terms that came from curations in other species:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | grep -v InterPro | cut -f1 | sort | uniq | wc
   3113   

Count the number of genes with GO terms that came from InterPro domains:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | grep InterPro | cut -f1 | sort | uniq | wc
      7912      

Count the number of distinct GO terms in the file:
%  grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | cut -f4 | tr ',' '\n' | sort | uniq | wc 
2510 [compared to 2673 for S. mansoni] 
Count the number of transfers from each reference species:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | cut -f2 | grep -v InterPro | tr ',' '\n' | cut -d"(" -f2 | sort  | uniq -c
   2171 caenorhabditis_elegans)
   1979 danio_rerio)
   2735 drosophila_melanogaster)
   4560 homo_sapiens)



Notes:
- Bhavana's files are all in /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/
- Bhavana's notes on how to run the pipeline are in /nfs/helminths02/analysis/50HGP/00ANALYSES/final_GO_terms/00README

Thank you:
Thank you Bhavana Harsha for making nice notes on how to run the pipeline.

No comments: