I have discussed with colleagues lately how to speed up some large BLASTX jobs. Here are several ideas that we came up with:
1.) Reduce the database you are searching:
Just take a selection of representative species for the taxa of interest (eg. Bacteria), rather than all species. This will probably not miss many proteins. However, some species have species-specific genes or even strain-specific genes, especially in bacteria.
2.) Search a non-redundant database:
You can download the whole NCBI nr database from the NCBI ftp site NCBI ftp site. However, it is pretty big (14 Gbyte unzipped). Also, it has all species in it, so it can be slow to extract a subset of sequences (ie. idea (1) above). Instead you can download RefSeq sequences for a particular species or set of species, by going to the NCBI Protein website, and setting 'Limits' to Field=Organism and Source Database=RefSeq, and then searching for a particular species (eg. 'Escherchia coli') in the search box.
3). Use a larger BLAST word size:
The word size can be set in BLAST using the -W option. By default it is set to 3 for proteins. If you set it to 5, it will speed up your BLAST search, although with some loss of sensitivity.
4). Stripe your files in a /lustre filesystem:
If you are using a /lustre filesystem, it will speed up your BLAST search if you stripe the files in the directory. You can do this by typing:
% lfs setstripe <dir> -c -1
where <dir> is the directory containing your BLAST database.
This is a good idea as you will probably be running many BLAST jobs against this single database at the same time. (By striping a large database file across all OSTs, it maximises the IO bandwidth. As the file is large, the overhead in opening the striped file is negligible compared to the time it takes to read the data.)
5). Use megablast instead of BLASTN:
If you are trying to speed up BLAST searches between very similar sequences, Megablast is faster than normal BLASTN.
Thank you for posting this article. It was quite helpful.
ReplyDeleteI had to try -word_size 4 instead of 5.. If I set to 5, blastx seg-faults.
> blastx -query test.fasta -db nr.01 -out output -outfmt 6 -evalue 0.001 -max_target_seqs 50 -word_size 5 -dbsize 12354844215
> 3637 Segmentation fault
> blast returned code: 139
I wonder why?
Also, I've noticed that setting -window_size to some large number is quite effective in reducing the amount of memory required, and the processing time. It finds less hits, but for some large queries it was quite necessary to get my search completed in time.