Monday 10 August 2015

I'm using the to scan a FASTA file of proteins for matches to Pfam domains. You can download from the Pfam group's ftp site.

Getting Pfam HMMs

To run will need to download the following files from the Pfam ftp site
(you can get from; you can get the files below from for example (go to the latest release)):

Pfam-A.hmm : Pfam-A library of HMMs
Pfam-A.hmm.dat  : contains info about each Pfam-A family
active_site.dat : contains active site info about each family (required for the -as option)

You will need to generate binary files for Pfam-A.hmm by running the following commands:
% hmmpress Pfam-A.hmm

Note that the current Pfam HMMs are in HMMER3 format, and that works fine with HMMER3 format.

Also note that, according to Pfam's release notes, Pfam-B has been discontinued from release 28.0 onwards.


Usage: -fasta <fasta_file> -dir <directory location of Pfam files>
Useful options are:
-outfile <file>   : output file, otherwise send to STDOUT

eg. -fasta temp.pep -dir .

Output from  

Each output line contains the following information:

<seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value><significance> <clan> <predicted_active_site_residues>

Example output (with -as option):
Y74C9A.3      12    225     11    225 PF05891.8   Methyltransf_PK   Family     2   218   218    312.5   1.2e-93   1 CL0063   

Note that Pfam groups together families with a common evolutionary ancestor into clans. If there are overlapping matches within a clan, only shoes the most significant (with lowest E-value) match within the clan. 

Memory and Run-time

I found that for a protein fasta file of 5000 sequences (some C. elegans protein sequences), needed about 1000 Mb of memory (RAM) to run (I requested 2000 Mb when I submitted it to our compute farm). It took about 1 hour and 20 minutes to run.

1 comment:

Unknown said...

Awesome! Exactly what I was looking for! Thanks for the easy and short tutorials. I guess it is also pretty handy when you look back for something you have done quite a while ago right ?