Monday, 10 August 2015

pfam_scan.pl

I'm using the pfam_scan.pl to scan a FASTA file of proteins for matches to Pfam domains. You can download pfam_scan.pl from the Pfam group's ftp site.

Getting Pfam HMMs

To run pfam_scan.pl will need to download the following files from the Pfam ftp site
(you can get pfam_scan.pl from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/; you can get the files below from for example ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/ (go to the latest release)):


Pfam-A.hmm : Pfam-A library of HMMs
Pfam-A.hmm.dat  : contains info about each Pfam-A family
active_site.dat : contains active site info about each family (required for the -as option)


You will need to generate binary files for Pfam-A.hmm by running the following commands:
% hmmpress Pfam-A.hmm


Note that the current Pfam HMMs are in HMMER3 format, and that pfam_scan.pl works fine with HMMER3 format.

Also note that, according to Pfam's release notes, Pfam-B has been discontinued from release 28.0 onwards.

Running pfam_scan.pl

Usage:
pfam_scan.pl -fasta <fasta_file> -dir <directory location of Pfam files>
Useful options are:
-outfile <file>   : output file, otherwise send to STDOUT

eg. 
pfam_scan.pl -fasta temp.pep -dir .

Output from pfam_scan.pl  

Each output line contains the following information:

<seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value><significance> <clan> <predicted_active_site_residues>
 

Example output (with -as option):
Y74C9A.3      12    225     11    225 PF05891.8   Methyltransf_PK   Family     2   218   218    312.5   1.2e-93   1 CL0063   


Note that Pfam groups together families with a common evolutionary ancestor into clans. If there are overlapping matches within a clan, pfam_scan.pl only shoes the most significant (with lowest E-value) match within the clan. 


Memory and Run-time

I found that for a protein fasta file of 5000 sequences (some C. elegans protein sequences), pfam_scan.pl needed about 1000 Mb of memory (RAM) to run (I requested 2000 Mb when I submitted it to our compute farm). It took about 1 hour and 20 minutes to run.

1 comment:

Unknown said...

Awesome! Exactly what I was looking for! Thanks for the easy and short tutorials. I guess it is also pretty handy when you look back for something you have done quite a while ago right ?