I'm using the pfam_scan.pl to scan a FASTA file of proteins for matches to Pfam domains. You can download pfam_scan.pl from the Pfam group's ftp site.
Getting Pfam HMMs
To run pfam_scan.pl will need to download the following files from the Pfam ftp site
(you can get pfam_scan.pl from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools/; you can get the files below from for example ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/ (go to the latest release)):
Pfam-A.hmm : Pfam-A library of HMMs
Pfam-A.hmm.dat : contains info about each Pfam-A family
active_site.dat : contains active site info about each family (required for the -as option)
You will need to generate binary files for Pfam-A.hmm by running the following commands:
% hmmpress Pfam-A.hmm
Note that the current Pfam HMMs are in HMMER3 format, and that pfam_scan.pl works fine with HMMER3 format.
Also note that, according to Pfam's release notes, Pfam-B has been discontinued from release 28.0 onwards.
pfam_scan.pl -fasta <fasta_file> -dir <directory location of Pfam files>
Useful options are:
-outfile <file> : output file, otherwise send to STDOUT
pfam_scan.pl -fasta temp.pep -dir .
Output from pfam_scan.pl
Each output line contains the following information:
<seq id> <alignment start> <alignment end> <envelope start> <envelope end> <hmm acc> <hmm name> <type> <hmm start> <hmm end> <hmm length> <bit score> <E-value><significance> <clan> <predicted_active_site_residues>
Example output (with -as option):
Y74C9A.3 12 225 11 225 PF05891.8 Methyltransf_PK Family 2 218 218 312.5 1.2e-93 1 CL0063
Note that Pfam groups together families with a common evolutionary ancestor into clans. If there are overlapping matches within a clan, pfam_scan.pl only shoes the most significant (with lowest E-value) match within the clan.
Memory and Run-time
I found that for a protein fasta file of 5000 sequences (some C. elegans protein sequences), pfam_scan.pl needed about 1000 Mb of memory (RAM) to run (I requested 2000 Mb when I submitted it to our compute farm). It took about 1 hour and 20 minutes to run.