avrilomics: Magdalena's functional annotation pipeline

My colleague Magdalena Zarowiecki has written a pipeline for functional annotation of the proteome of a newly sequenced species [note: this is only available to Sanger users at present]. The steps are:

1) Download Uniprot from http://www.uniprot.org/downloads (UniProt/SwissProt fasta file), and save as file uniprot.fa.

2) Run Magdalena's script to clean up the UniProt names:
Use Magdalena's script uniprot_name.pl:
% /nfs/users/nfs_m/mz3/bin/perl/uniprot_name.pl uniprot.fa
This will make an output file uniprot.fa.renamed
Some of the proteins have been renamed, for example, O_ is added to the start of the names of human proteins, C_ to the start of the names of C. elegans proteins, U_ to the start of the names of mouse proteins, etc.

3) Run blastp against the UniProt database, using Martin Hunt's blast_splitter.py script:
% blast_splitter.py --protein_ref --splitmem=7 test.fa uniprot.renamed.fa ./blast_splitter 250000 -e 0.05 -p blastp -m8
where test.fa is your query fasta file of proteins that you want to annotate. Magdalena suggested to use the -splitmem=5 or -splitmem=7 option. This will make an output directory blast_splitter with a file 'all.blast' that has the blast output. The 250000 means that the test.fa file is split into smaller files of 250,000 residues (amino acids here) each, for running blast. The output from blast_splitter.py will be in a subdirectory (called 'blast_splitter' here), and is a file called 'all.blast'.
Note: you don't need to 'bsub' the blast_splitter.py command.
[Note: Martin Hunt has now replaced blast_splitter.py by farm_blast]

4) For each query, take the top 10 blast hits of evalue <= 1e-5, and write their functional descriptions to a file:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/top10blast.pl blast_splitter/all.blast uniprot.fa.renamed > blast.tab
The blast.tab file has functional descriptions for the blast query proteins (in test.fa), based on the blast hits:
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog A"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog B"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51"

5) Run Magdalena's script to tidy up the functional descriptions in the blast.tab file:
Use Magdalena's script uniprot_clean.pl:
% /nfs/users/nfs_m/mz3/bin/perl/uniprot_clean.pl blast.tab blast.tab2
[Note: last time I tried this script, it had some problems, so I skipped it]
Sometimes (but not always) some functional descriptions will be different in blast.tab2 (eg. poor descriptions such as 'HC10323' are replaced by 'mz3').

6) Run Magdalena's script to combine the functional descriptions of different blast hits for the same query protein:
Use Magdalena's script product_mangler.pl:
% /nfs/users/nfs_m/mz3/bin/perl/product_mangler.pl blast.tab2 blast.tab3
WORSE1: SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51"    45.8
BEST1: SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1" 383.6
WORSE1: SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog A" 54.8
WORSE1: SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog B" 55
############## ROUND 1 ################
############## ROUND 2 ################
############## ROUND 3 ################

Here is another example:
BEST1: SRAE_1000001300.t2:mRNA /product=" Cyclin-dependent kinase-like 1"      40
BEST1: SRAE_1000001300.t2:mRNA /product=" Cyclin-dependent kinase-like 2"      40
WORSE1: SRAE_1000001300.t2:mRNA /product=" Cyclin-dependent kinase-like 4"      20
############## ROUND 1 ################
BEST2: SRAE_1000001300.t2:mRNA /product=" cyclin dependent kinase-like 2"      1
BEST2: SRAE_1000001300.t2:mRNA /product=" cyclin dependent kinase-like 1"      1
############## ROUND 2 ################
BEST3: SRAE_1000001300.t2:mRNA /product=" cyclin dependent kinase-like"        2
############## ROUND 3 ################

Here the final description comes in ROUND3, and is labelled as 'BEST3'. Sometimes a protein doesn't improve past ROUND1, so its best description is labelled as 'BEST1'.

7) Optional: run blast against the GenBank (nr) database: as an alternative (or addition) to running blast against Uniprot, Magdalena said that you could run blast against the GenBank (nr) database.

To get the functional annotation from the GenBank file you need to use Magdalena's script:
/nfs/users/nfs_m/mz3/bin/perl/genbank_get_products.pl [takes the entire GenBank file, and parses out the product names]
Then clean up the product descriptions using her script:
/nfs/users/nfs_m/mz3/bin/perl/genbank_clean.pl

Then add the names to your products (after you have run blast), using her script:
/nfs/users/nfs_m/mz3/bin/perl/genbank2similarity.pl
Now choose amongst the best product names, based on the top blast hits:
/nfs/users/nfs_m/mz3/bin/perl/product_mangler.pl

Magdalena said you can run blast against UniProt and GenBank and merge together the results if you wish.

8) Run pfamscan to predict functions of proteins for which we don't have any functional prediction, or just prediction "hypothetical":
First make a fasta file of the proteins that don't have any functional prediction:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/getprotswithoutannotn.pl test.fa > test2.fa
[Note: at the moment this script doesn't take proteins marked 'hypothetical'].

Now run pfamscan using the protein fasta file as query, using Magdalena's script pfamscan_splitter.pl:
% /nfs/users/nfs_m/mz3/bin/perl/pfamscan_splitter.pl test2.fa testpfam 500
[Note: pfamscan_splitter.pl is not yet available on farm3, so has to be run on farm2, you must run it on farm2 using a copy in ~alc/Documents/PerlScripts/]
where test2.fa is your protein fasta file, testpfam is the prefix you want to give to the output files.

The query file test2.fa is broken up into several smaller files for running pfamscan, and in this case 500 is the number of bytes to put in each smaller file (see here for how to work out the number of bytes to put here).

The output files will be called testpfam_1.pfam, testpfam_1.pfam, etc. They will look like this:
# <seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value> <significance> <clan>

SRAE_2000357600.t1:mRNA     13     82      8     83 PB003712    Pfam-B_3712       Pfam-B    13    82    93     39.2     9e-10 NA NA
SRAE_2000311000.t1:mRNA     78    330     76    331 PF08423.6   Rad51             Domain     3   255   256    368.2 1.3e-110   1 CL0023
. . .

Now make a file with the existing best annotation for the proteins:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/makefunctiontable.pl test2.fa > test2.fa.txt
Put the pfam results in a file:
% grep -v "#" testpfam_1.pfam | grep 'PF' > pfam_results
% grep -v '#" testpfam_1.pfam | grep 'PB' >> pfam_results
% cut -d":" -f2-100 pfam_results > pfam_results2
Now get the product names from the pfamscan output using Magdalena's script product_from_Pfamscan.pl:
% /nfs/users/nfs_m/mz3/bin/perl/product_from_Pfamscan.pl pfam_results2 test2.fa.txt mypfam
This makes files 'mypfam.domains', 'mypfam.errors', and 'mypfam.products'. 'mypfam.products' is like this:
SRAE_2000357600.t1:mRNA /product="Pfam-B_3712 domain containing protein"        /note="Pfam"

Magdalena said the protein is given a name according to the domain it contains, eg. 'WAP-domain-containing protein'. If there are several domains, it is 'WAP and AR domain containing'.

9) Optional: get GO annotation from the pfamscan output:
Now, to get GO annotation from the pfamscan output, download the table of GO terms to Pfam domains from http://www.geneontology.org/external2go/pfam2go.
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/makepfamtogotable.pl pfam2go > pfam2go.tab
Then run Magdalena's script pfam2GO_genes.pl:
% /nfs/users/nfs_m/mz3/bin/perl/pfam2GO_genes.pl pfam2go.tab testpfam_1.pfam
This makes a file testpfam_1.pfam.out.

Now make a gff containing all the pfam domains as features, using Magdalena's pfam2gff_n_fasta.pl script:
% /nfs/users/nfs_m/mz3/bin/perl/pfam2gff_n_fasta.pl testpfam_1.pfam test2.fa
This makes a file testpfam_1.pfam.gff which looks like this:
SRAE_2000357600.t1:mRNA domain gene    8       83      .       +       .       ID=SRAE_2000357600.t1:mRNA-Pfam-B_3712:1
SRAE_2000357600.t1:mRNA domain CDS     8       83      .       +       .       ID=SRAE_2000357600.t1:mRNA-Pfam-B_3712:1:exon:1;Parent=SRAE_2000357600.t1:mRNA-Pfam-B_3712:1

10) Optional: run interproscan to predict functions of proteins for which we don't have any functional prediction, or just prediction "hypothetical":
Note that Magdalena said that as an alternative, or additional step, to running pfamscan, you could run interproscan (see here).
To run interproscan, Magdalena suggested to use the script
~/bin/perl/interpro_scan_splitter.pl
Then to parse the results you can use:
/nfs/users/nfs_m/mz3/bin/perl/product_from_interpro.pl
/nfs/users/nfs_m/mz3/bin/perl/parse_interpro.pl

[Alternatively, if you have a gff file of interproscan results for all proteins in test.fa:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/interpro_gff_to_tab.pl /lustre/scratch108/parasites/jc17/Onchocerca/OVOC_v3.protein.interproscan.gff > interproscan
Then:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/getprotswithoutannotn.pl test.fa > test2.fa
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/makefunctiontable.pl test2.fa > test2.fa.txt
% /nfs/users/nfs_m/mz3/bin/perl/product_from_Pfamscan.pl interproscan test2.fa.txt mypfam
see above]

11) Combine the functional annotations from blast and pfamscan:
Finally, you can combine the functional annotations from blast and pfamscan.
First pull out the best annotation for each protein from the blast file (blast.tab3):
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/getbestblastannotn.pl blast.tab3 > blast.tab4
Concatenate the functional predictions from pfam and blast:
% cat mypfam.products blast.tab4 > functions1

Now use Magdalena's script product_chooser.pl:
% /nfs/users/nfs_m/mz3/bin/perl/product_chooser.pl functions1 functions2
The output file 'functions2' looks like this:
SRAE_2000311000.t1:mRNA /product=" DNA repair protein RAD51 homolog 1"
SRAE_2000357600.t1:mRNA /product="Pfam-B_3712 domain containing protein"

Magdalena said that the product chooser takes in several different functional annotations for a protein, and assigns a score to each alternative functional annotation. It tries to make the highest-scoring ones more similar to each other (eg. by changing lowercase to uppercase, changing word order, removing the last word, etc.).

Magdalena said that it if 3 of the functional annotations for a protein are 'hypothetical', and 7 say something different (and agree with each other), it will give the second annotation. However, if 7 of the annotations are 'hypothetical' and the other 3 all disagree with each other, the final annotation is 'hypothetical'.

Magdalena said that if you have additional annotion files (eg. with expression information, or saying with proteins are conserved based on all-versus-all blastp or ortho-mcl), then you could merge this information too with product_chooser.pl. So even if a protein doesn't have any blast or Pfam match, it could be called 'conserved expressed transcript'.

12) Add the functional annotations to the fasta file of proteins:
% perl -w ~alc/Documents/PerlScripts/mz3_annotation/addfunctionstofasta.pl functions2 test.fa > test.fa_v2

Thanks to Magdalena Zarowiecki for help using her scripts.

avrilomics

Friday, 2 August 2013

Magdalena's functional annotation pipeline

1 comment:

About Me