I'm interested in finding conserved non-coding sequences between two related species of worms.
First I took the introns, UTRs, and intergenic regions from the first species, and tried comparing them to the genome of the second species using exonerate, but that was very slow. I then tried BLAT, which was a little faster. Then I tried BLASTN, which was nice and speedy!
It's been a while since I ran BLAST on the Sanger farm so I needed to remind myself how to run it, even though I have written previous posts on that ages ago (e.g. on farm_blast and on speeding up blast jobs).
This is what I did now:
Find the BLAST module on the farm, and load it: (only applicable to Sanger users)
% module avail -t | grep -i blast
blast/2.7.1=h96bfa4b_5
% module load blast/2.7.1=h96bfa4b_5
Make a blast database:
% makeblastdb -in genome2.fa -dbtype nucl
Run blast:
% blastn -db genome2.fa -query genome1_intronsandutrsandintergenic.fa -out myoutput.blast -outfmt 6
One thing I always always forget is what are the columns in the BLAST m8 format, so I have to look at this nice webpage.
Note that by default the blastn command runs Megablast, which looks for matches of high percent identity, and is a fast algorithm. I'm interested in high percent identity matches, so I used this.
Alternatives to BLAST:
An alternative to BLAST is nucmer, part of the mummer package, which I wrote a post on ages ago (see here). Note to self: nucmer is part of the mummer module on the Sanger farm.
I asked my colleagues what they are using nowadays for whole genome alignements, and they mentioned a couple of other software:
- my colleague Eerik Aunin mentioned the software SibeliaZ, which is tailored for aligning highly similar genomes, eg. strains of the same species,
- my colleague Faye Rodgers mentioned Cactus, which can be used to make alignments of 1000s of vertebrate genomes,
- my colleague Ana Protasio mentioned Satsuma
Regarding finding conserved noncoding regions, my colleague James Cotton mentioned PhastCons.