Aligning ESTs/cDNAs to a genome using exonerate
To align ESTs (or EST clusters) or cDNAs to a genome sequence, the program exonerate can be used. This can be run using the command:
% exonerate --model est2genome --bestn 10 est.fa genome.fa
where 'est.fa' is your fasta file of ESTs, and 'genome.fa' is your fasta file of scaffolds/chromosomes for your genome. The '--bestn 10' option finds the top 10 matches of each EST in the genome.
Other useful options are:
-t genome.fa : the fasta file 'genome.fa' of scaffolds/chromosomes for your genome,
-q est.fa : the fasta file 'est.fa' of ESTs,
--showtargetgff : give the output in gff format,
--showalignment N : do not show the alignment between the EST and genome in the output,
--refine full : this performs a more accurate alignment between the EST and genome, but will be slower (I found this very slow),
--model coding2genome : the exonerate man page says that this is similar to '--model est2genome', but that the query sequence is translated during comparison, allowing a more sensitive comparison.
In practice, I found that it is fastest to run exonerate by putting each EST cluster into a separate fasta file, and then running exonerate for each EST cluster against the genome assembly. For example, to align Parastrongyloides trichosuri EST clusters to the genome assembly, I put the EST sequences in files seq1, seq2, seq3, etc. and ran the following shell script:
foreach file (seq*)
exonerate -t genome.fa -q $file --model coding2genome --bestn 1 --showtargetgff > $file.out
I found that this required quite a lot of memory (RAM), so I requested 500 Mbyte of RAM when I submitted it to the Sanger compute farm.
This produced files seq1.out, seq2.out, seq3.out, etc. with the exonerate results for EST clusters seq1, seq2, seq3, etc.
Each exonerate gene prediction is given a score. I find that gene predictions with scores of at least 1000 look pretty good.
Getting the predicted proteins in ESTs from exonerate
If you use the --model coding2genome option, exonerate will give you a protein alignment for the predicted gene and the EST that you align it to. If you wanted to get the predicted protein from the EST that is in that alignment, you could use this extra option:
exonerate --ryo ">%qi (%qab - %qae)\n%qcs\n"
(see here and here).
Aligning proteins to a genome using exonerate
To align proteins to a genome, you need to use '--model protein2genome' instead of '--model coding2genome', eg.:
% exonerate -t genome.fa -q protein.fa --model protein2genome --bestn 1 --showtargetgff
Running BLAST first, and then running exonerate for the genome regions with BLAST hits:
Exonerate can be a bit slow to run, so if you want to speed things up (with some loss of sensitivity), you can run BLAST first to find the regions of a genome assembly with hits to your query EST or protein, and then run exonerate to align the query EST/protein to the region of the BLAST hit.
You can use my script run_exonerate_after_blast.pl for this, eg.
% run_exonerate_after_blast.pl assembly.fa proteins.fa exonerate_output . 0.05 25000 /software/pubseq/bin/ncbi_blast+/
where assembly.fa is your input assembly file,
proteins.fa is the file of proteins you want to align using exonerate,
exonerate_output is the exonerate output file, processed to just have some of the GFF lines,
0.05 is the BLAST e-value cutoff to use,
25000 means that we run exonerate on the BLAST hit region plus 25,000 bp on either side,
/software/pubseq/bin/ncbi_blast+/ is the path to the directory with BLAST executables.
[Note to self: you will probably need to submit this to the 'long' queue (48 hour time-limit) on the Sanger farm, with 1000 Mbyte of RAM.]
If the job runs out of run-time on the Sanger farm before finishing, the output so farm will be in a large file called tmp0.xxx... where xxx... are digits. You can convert this to the format the the output should be in by using my script convert_exonerate_gff_to_std_gff.pl.
Aligning intron regions of one genome to another genome
In this example, I had a file of intron sequences from one species, and wanted them to align them to the full genome of a second species. I decided to use the 'ryo' (roll-your-own) output format, which allows you to specify your own output format:
% exonerate --model affine:local --revcomp --ryo "%qi %qab %qae %ti %tab %tae %s %pi" --showalignment no --bestn 1 introns.fa genome2.fa
where -model affine:local will give local alignments,
--revcomp looks on both strands for matches,
--ryo "%qi %qab %qae %ti %tab %tae %s %pi" prints out the query id., start and end of the alignment in the query, target id., start and end of the alignment in the target, alignment score, and percent identity of the alignment.
--showalignment no: means the alignment isn't shown
--bestn 1 : just give the best hit for each query.
Other useful exonerate options
--model protein2genome:bestfit : this includes the entire protein in the alignment.
--exhaustive yes : makes more accurate alignments (but is slower).