Wednesday, 3 July 2013

Running the path-dev functional annotation pipeline

This is only of interest to Sanger users, as it's only available on the Sanger farm. The path-dev group (Jacquilline Keane's team) have made a script called annotate_bacteria for annotation of bacterial genomes. It is based on PROKKA and is tailored for bacteria, archaea and viruses. It works by taking an assembly as input and identifying ORFs with Prodigal, and predicting RNA genes using RNAmmer and Aragorn.

It then predicts the functions of ORFs, by running BLAST against a database of proteins from RefSeq and UniProt (by default these are bacterial, archaeal and viral proteins), and comparing to domain databases (PfamA, CDD), and also runs SignalP to predict signal peptides. It gives evidence codes on the description lines to give the sources of the functional annotations.

To run it you type: [on farm3]
% annotate_bacteria -a assembly.fa --dbdir /lustre/scratch108/pathogen/pathpipe/prokka --sample_name MyExample
where assembly.fa is your input assembly,  /lustre/scratch108/pathogen/pathpipe/prokka is the directory with the sequence databases to run BLAST against, and MyExample is the label to give to the job.

The output appears in a subdirectory called 'annotation'. There is a file called MyExample.tbl that contains a summary of the annotation, eg.
>Feature HelminthExample|SC|contig000001
1       1254    CDS
                        EC_number       3.4.24.76
                        inference       ab initio prediction:Prodigal:2.60
                        inference       similar to AA sequence:UniProtKB:Q47899
                        inference       protein motif:Pfam:PF01400.18
                        locus_tag       HelminthExample_00001
                        product Flavastacin precursor
                        product Astacin (Peptidase family M12A)
                        protein_id      gnl|SC|HelminthExample_00001


Notes:
- This runs fine on farm3, but not on farm2. 
- Prodigal does not seem to predict partial genes (lacking a start and/or stop codon).
- Contigs in your input assembly.fa that are <200 bp are discarded.
- The RNA gene prediction step takes a long time.
- If you want to run a particular version of interproscan, you can do this with the -e option, eg. -e /software/pathogen/external/apps/usr/local/iprscan-5.0.7/interproscan.sh

No comments: