I've written a pipeline to transfer GO terms to their orthologs in other species. Here are some notes to remind myself how to use it:
To run it for a new species (eg. T. regenti):
- Make a directory eg. /lustre/scratch108/parasites/alc/000_50HG_GO/tregenti
- Make a link to the file with the orthologs in the reference species:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/trichobilharzia_regenti_orths.txt
- Make a link to the file with a protein list for this species:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/trichobilharzia_regenti_protein_list.txt
- Make a file species_list.txt with just the name of the species of interest, eg. trichobilharzia_regenti
- Make a link to the file with locus tag prefixes:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/species_locustagprefix
- Make links to the GO annotation and ontology files:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.fb
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.goa_human
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.WS243.wb.c_elegans
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/gene_association.zfin
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/go-basic.obo
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/go_terms_files
- Make links to files with mappings between identifiers:
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/ensembl_zfish_ids
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/uniprot_human_ids2
% ln -s /lustre/scratch108/parasites/bh4/Compara/50HG_Compara_75_Final/final_id_mapping/trichobilharzia_regenti_id_mapping.txt
% ln -s /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/other_id_files
- Now run the script to assign GO terms to genes:
% /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/assign_go_terms_wrapper.sh species_list.txt > bsub_go_terms_jobs
This does:
% bsub -q small -E 'test -e /nfs/users/nfs_a/alc' -R "select[mem>200] rusage[mem=200]" -M200 -o go_terms.o -e go_terms.e -J GO_terms_trichobilharzia_regenti assign_go_terms_to_genes.py trichobilharzia_regenti trichobilharzia_regenti_protein_list.txt trichobilharzia_regenti_orths.txt go_terms_files other_id_files go-basic.obo /nfs/helminths02/analysis/50HGP/01INTERPRO/TRE.protein.fa.gz.fas.ipr.gz trichobilharzia_regenti_GO_terms.txt trichobilharzia_regenti_id_mapping.txt
This makes an output file called trichobilharzia_regenti_GO_terms.txt.
Summarising the results
Count the number of genes with GO terms:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | cut -f1 | sort | uniq | wc
8719 [compared to 6215 for S. mansoni]
Count the number of genes with GO terms that came from curations in other species:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | grep -v InterPro | cut -f1 | sort | uniq | wc
3113
Count the number of genes with GO terms that came from InterPro domains:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | grep InterPro | cut -f1 | sort | uniq | wc
7912
Count the number of distinct GO terms in the file:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | cut -f4 | tr ',' '\n' | sort | uniq | wc
2510 [compared to 2673 for S. mansoni]
Count the number of transfers from each reference species:
% grep TRE trichobilharzia_regenti_GO_terms.txt | grep 'GO:' | cut -f2 | grep -v InterPro | tr ',' '\n' | cut -d"(" -f2 | sort | uniq -c
2171 caenorhabditis_elegans)
1979 danio_rerio)
2735 drosophila_melanogaster)
4560 homo_sapiens)
Notes:
- Bhavana's files are all in /lustre/scratch108/parasites/bh4/50HGI_FuncAnnotation_GO_terms/
- Bhavana's notes on how to run the pipeline are in /nfs/helminths02/analysis/50HGP/00ANALYSES/final_GO_terms/00README
Thank you:
Thank you Bhavana Harsha for making nice notes on how to run the pipeline.
No comments:
Post a Comment