Thursday 30 May 2013

Using blat to align ESTs/cDNAs to a genome

BLAT by Jim Kent can be used to align ESTs, cDNAs to a genome. It is extremely fast.
It is an alternative to exonerate, which is slower but more accurate than BLAT.
Note that you can also use BLAT to align proteins (query) to proteins (in a database), or to align proteins (query) to a genome (DNA).

Aligning ESTs/cDNAs to a genome using BLAT
% blat assembly.fa ests.fa out.blat -out=blast8 -t=dna -q=dna
where assembly.fa is your assembly fasta file,
ests.fa is your fasta file of ESTs,
out.blat is the output file name,
-out=blast8 means the output format will be BLAST m8 format (by default the format is psl format),
-t=dna tells BLAT the database is DNA,
-q=dna tells BLAT the query is DNA.

Aligning proteins to a genome using BLAT
% blat assembly.fa proteins.fa out.blat -out=blast8 -t=dnax -q=prot
where assembly.fa is your assembly fasta file,
proteins.fa is your fasta file of proteins,
out.blat is the output file name,
-out=blast8 means the output format will be BLAST m8 format (by default the format is psl format),
-t=dnax tells BLAT the database is DNA,
-q=prot tells BLAT the query is proteins.

Aligning a short 44-bp sequence to Illumina reads using BLAT
I wanted to use BLAT to search for a short 44-bp sequence in some Illumina reads. I found that I needed to use -tileSize=8 in BLAT, as otherwise BLAT misses the 44-bp sequence in many reads (in which it is actually found), and also gets the coordinates slightly wrong. When I use -tileSize=8 it works much better and finds the cases I expect to find, and also gets the coordinates right. 

Later, I tried searching for an even shorter 21-bp sequence and found that I had to use -minScore=0 -stepSize=1 -repMatch=50000000 as well, to ensure that BLAT reported hits that I knew were there. This is described in the BLAT FAQ under the heading 'How do I configure BLAT for short sequences with maximum sensitivity?'. See also here to see more about searching for short matches.

Thanks
Thanks to John Liechty for advice on using BLAT to align proteins to a genome.

No comments: