Tuesday 9 July 2013

bigger_blast.pl

This is of interest to Sanger users only.

Update: October 2013: farm_blast 
Martin Hunt has replaced bigger_blast.pl and blast_splitter.py with farm_blast.
For example:
% farm_blast -p blastp seq_update.fasta Bm.protein.fa --outdir inv_output 
which runs blastp using Bm.protein.fa as the query file, seq_updata.fasta as the database fasta file, and puts the results in output directory inv_output (which shouldn't exist beforehand). This doesn't need to be b-subbed, it will b-sub the jobs itself.

By default, farm_blast requests 500 Mbyte of memory. To request 1000 Mbyte (1 Gbyte), you can type:
% farm_blast -p blastp seq_update.fasta Bm.protein.fa --outdir inv_output --blast_mem 1.0

Farm_blast gives m8 format. The columns for m8 format are:
query id, subject id, % identity, alignment length, mismatches, gap opens, q. start, q. end, s. start, s. end, evalue, bit score

Another thing to know about farm_blast is that by default it splits up large sequences into chunks of 500000 bp. You can change the chunk size using the --split_bases option.

Running blast yourself 
You can run blast yourself by typing:
% makeblastdb -in mydb.fa -input_type fasta -dbtype prot
This formats a protein fasta file mydb.fa as the blast database.
Then run blast:
% blastp -query myquery.fa -db mydb.fa -task blastp -outfmt 6 -seg yes
This runs blastp between myquery.fa and database mydb.fa.
If you want SEG to be run, include '-seg yes' (the default is no SEG filtering).

Note: for blastn you can run '-dust yes' instead of '-seg yes'.


Now the rest of this blog (below) is obsolete, so I've changed it to pale grey.

--------------------------------------------------------------------------------------------------------------------------

Bigger_blast.pl
The path-dev group at Sanger have a script called bigger_blast.pl, that can be used to run blast jobs in parallel. You can get all the command-line options by typing:
% bigger_blast.pl
It can be run for example by typing:
% bigger_blast.pl -x -o myout2 /data/blastdb/Supported/nr my.fa
where -x selects blastx, -o specifies a prefix for the output file, /data/blastdb/Supported/nr is the database to run blast against, and my.fa is the query file.

The output files will be called myout.crunch.gz and myout.tab.gz. 

The myout .crunch.gz was produced using 'MSPcrunch -d' (ie. a crunch format file, although bigger_blast.pl doesn't use MSPcrunch to make it), and will look like this (you can look at it by typing 'gunzip -c myout .crunch.gz'):
72 49 11 81 SRAE_2000357600.t1:mRNA 2 72 E2AMU5.1 E2AMU5_CAMFO Histone-lysine N-methyltransferase SETMAR (Fragment)
70 48 13 83 SRAE_2000357600.t1:mRNA 22 92 E2BZV4.1 E2BZV4_HARSA Histone-lysine N-methyltransferase SETMAR (Fragment)
70 43 8 83 SRAE_2000357600.t1:mRNA 260 335 D2WLV0.1 D2WLV0_AGRPL Transposase
67 43 10 83 SRAE_2000357600.t1:mRNA 132 205 F4WLV9.1 F4WLV9_ACREC Histone-lysine N-methyltransferase SETMAR
67 45 10 83 SRAE_2000357600.t1:mRNA 20 93 E2C0B8.1 E2C0B8_HARSA Histone-lysine N-methyltransferase SETMAR (Fragment)

. . .
The myout.tab.gz is an EMBL feature table with one feature per hit. 

[Note: bigger_blast.pl doesn't run on farm3 yet.]

Running blastp
To run blastp, use '-p' instead of '-x'.

Setting the queue
To set the farm queue, use the '-q' option, eg.
% bigger_blast.pl -q yesterday -x -o myout2 /data/blastdb/Supported/nr my.fa

Setting the number of jobs to run
By default, bigger_blast submits 16 jobs. To submit more, use the '-j' option, eg.
% bigger_blast.pl -j 100 -x -o myout2 /data/blastdb/Supported/nr my.fa


Making a MSPcrunch format filesfrom normal blast output:
You can also run blast normally and make a MSPcrunch format file by typing afterwards:
%  bigger_blast.pl blast_output.txt
This should take the output of your blast search and convert it to crunch format.


Alternative to bigger_blast.pl
An alternative to bigger_blast.pl is Martin Hunt's blast_splitter.py script. For example:
% blast_splitter.py --protein_ref --splitmem=7 test.fa uniprot.renamed.fa ./blast_splitter 250000 -e 0.05 -p blastp -m8
where test.fa is your query fasta file of proteins. Magdalena suggested to use the -splitmem=5 or -splitmem=7 option. This will make an output directory blast_splitter with a file 'all.blast' that has the blast output. The 250000 means that the test.fa file is split into smaller files of 250,000 residues (amino acids here) each, for running blast. The output from blast_splitter.py will be in a subdirectory (called 'blast_splitter' here), and is a file called 'all.blast'.

No comments: