Friday 5 April 2013

Running Interproscan on a gene set

 A useful way to annotate a genome is to run InterProScan on your gene set.

[Note: this is really a note-to-self, as it's only useful to Sanger users.]

Note: some of the below has recently become obsolete, so I've greyed it out.

Using Andrew Page's annotate_eukaryotes script:
At Sanger, you can run interproscan by using the annotate_eukaryotes script by Andrew Page, for example:
% annotate_eukaryotes -a Sratti_v5.2.genes_prot.fa -o Sratti_v5.interpro.gff
where Sratti_v5.2.genes_prot.fa is your input fasta file of proteins, Sratti_v5.interpro.gff is your interproscan output file.

Note: the Interproscan results give GO terms associated with the proteins. My colleague Bernardo Foth has a script for making a file of the GO terms for proteins (/nfs/users/nfs_b/bf3/bin/interproscanResults_parse_GOterms_01.pl).

Using iprscan_chado:
At Sanger, you can do this using the iprscan_chado:
1) Log into a screen window, eg. on farm2-head4:
    % screen -RD
2) Edit the .iprscan file in your home directory (eg. /nfs/users/nfs_a/alc/.iprscan) so that it has your team for submitting farm jobs (eg. team133) eg.
    LSB_DEFAULTPROJECT=team133
    export LSB_DEFAULTPROJECT

3) Start the InterProScan job using:
    % iprscan_chado -f <fasta>
    where <fasta> is your fasta file
    eg.
    % iprscan_chado -f Sratti_v4.genes_prot.fa
    This will submit your jobs to the farm, there is no need to use 'bsub'.
    The results will appear in file <fasta>.interpro, eg. Sratti_v4.genes_prot.fa.interpro
    Note: I found that the iprscan_chado program crashes if you have '|' symbols in your gene names, so you will need to rename your genes, and try to run it again if this is the case.

Using Magdalena's script to break up the input file before running InterProScan:
1) Log into pcs4
2) Figure out how many bytes are in your input fasta file (using 'ls -al'). If it is a soft-link, make sure you take the size of the original file.
3) Then work out how many bytes you need per file in order to get ~1000 sequences per file:
    (bytes_full_file * 1000)/seqs_full_file
    where bytes_full_file is the number of bytes in the input fasta file (from step 2), seqs_full file is the number of sequences in the input fasta file.
    This will give us a number of bytes_per_file. 
4) Use Magdalena's script to break up the input fasta file into files of about bytes_per_file bytes each:
    % perl ~mz3/bin/perl/interpro_scan_splitter.pl input_fasta prefix bytes_per_file
    where input_fasta is your input fasta file, prefix is the prefix to give to each of the smaller fasta files, bytes_per_file is the number of bytes each of the smaller fasta files should be.
    eg.
    % perl ~mz3/bin/perl/interpro_scan_splitter.pl Sratti_v5.genes_prot.fa SrattiV5 490000
    This will make many smaller files (usually ~10-20) with some of the sequences from your input fasta file in each smaller file. The smaller files will be called prefix_1, prefix_2, etc., eg. SrattiV5_1, SrattiV5_2, etc.
5) Open a screen session. If the smaller files made in step (4) are called SrattiV5_1, SrattiV5_2, etc. then in the first screen window type:
    % iprscan_chado -f SrattiV5_1
    In the second screen window type:
    % iprscan_chado -f SrattiV5_2
    In the third screen window type:
    % iprscan_chado -f SrattiV5_3
    and so on, until you have started iprscan_chado running for each of the smaller files made in step (4).

Using Magdalena's script to run pfamscan:
% perl -w /nfs/users/nfs_m/mz3/bin/perl/pfamscan_splitter.pl Sratti_v5.genes_prot.fa srattipfam 490000
This will make many smaller files (usually ~10-20) with some of the sequences from your input fasta file in each smaller file. The smaller files will be called prefix_1, prefix_2, etc., eg. srattipfam_1, srattipfam_2, etc.
The pfam jobs are submitted to the farm automatically.

Note added 29-Apr-2016: This doesn't seem to work anymore. See my blog on pfam_scan.pl instead.

Difference between interproscan and pfamscan
Interproscan is more comprehensive as it includes scanning pfam with other things, and it will tell you things such as how many domains of a certain type there are in a protein.

Thanks to Magdalena Zarowiecki for this.

No comments: