Monday, 18 February 2013

Using exonerate to align ESTs/cDNAs/proteins to a genome

Aligning ESTs/cDNAs to a genome using exonerate
To align ESTs (or EST clusters) or cDNAs to a genome sequence, the program exonerate can be used. This can be run using the command:
% exonerate --model est2genome --bestn 10 est.fa genome.fa
where 'est.fa' is your fasta file of ESTs, and 'genome.fa' is your fasta file of scaffolds/chromosomes for your genome. The '--bestn 10' option finds the top 10 matches of each EST in the genome.

Other useful options are:
-t genome.fa : the fasta file 'genome.fa' of scaffolds/chromosomes for your genome,
-q est.fa : the fasta file 'est.fa' of ESTs,
--showtargetgff : give the output in gff format,
--showalignment N : do not show the alignment between the EST and genome in the output,
--refine full : this performs a more accurate alignment between the EST and genome, but will be slower (I found this very slow),
--model coding2genome : the exonerate man page says that this is similar to '--model est2genome', but that the query sequence is translated during comparison, allowing a more sensitive comparison.

In practice, I found that it is fastest to run exonerate by putting each EST cluster into a separate fasta file, and then running exonerate for each EST cluster against the genome assembly. For example, to align Parastrongyloides trichosuri EST clusters to the genome assembly, I put the EST sequences in files seq1, seq2, seq3, etc. and ran the following shell script:

#!/usr/bin/env tcsh
foreach file (seq*)
      echo "$file"
      exonerate -t genome.fa -q $file --model coding2genome --bestn 1 --showtargetgff > $file.out
end

I found that this required quite a lot of memory (RAM), so I requested 500 Mbyte of RAM when I submitted it to the Sanger compute farm.
This produced files seq1.out, seq2.out, seq3.out, etc. with the exonerate results for EST clusters seq1, seq2, seq3, etc.

Each exonerate gene prediction is given a score. I find that gene predictions with scores of at least 1000 look pretty good.

Getting the predicted proteins in ESTs from exonerate
If you use the --model coding2genome option, exonerate will give you a protein alignment for the predicted gene and the EST that you align it to. If you wanted to get the predicted protein from the EST that is in that alignment, you could use this extra option:
exonerate --ryo ">%qi (%qab - %qae)\n%qcs\n"
(see here and here).

Aligning proteins to a genome using exonerate
To align proteins to a genome, you need to use '--model protein2genome' instead of '--model coding2genome', eg.:
% exonerate -t genome.fa -q protein.fa --model protein2genome --bestn 1 --showtargetgff

Running BLAST first, and then running exonerate for the genome regions with BLAST hits:
Exonerate can be a bit slow to run, so if you want to speed things up (with some loss of sensitivity), you can run BLAST first to find the regions of a genome assembly with hits to your query EST or protein, and then run exonerate to align the query EST/protein to the region of the BLAST hit.

You can use my script run_exonerate_after_blast.pl for this, eg.
% run_exonerate_after_blast.pl assembly.fa proteins.fa exonerate_output . 0.05 25000 /software/pubseq/bin/ncbi_blast+/
where assembly.fa is your input assembly file,
proteins.fa is the file of proteins you want to align using exonerate,
exonerate_output is the exonerate output file, processed to just have some of the GFF lines,
0.05 is the BLAST e-value cutoff to use, 
25000 means that we run exonerate on the BLAST hit region plus 25,000 bp on either side,
/software/pubseq/bin/ncbi_blast+/ is the path to the directory with BLAST executables.
[Note to self: you will probably need to submit this to the 'long' queue (48 hour time-limit) on the Sanger farm, with 1000 Mbyte of RAM.]

If the job runs out of run-time on the Sanger farm before finishing, the output so farm will be in a large file called tmp0.xxx... where xxx... are digits. You can convert this to the format the the output should be in by using my script convert_exonerate_gff_to_std_gff.pl.

Aligning intron regions of one genome to another genome

In this example, I had a file of intron sequences from one species, and wanted them to align them to the full genome of a second species. I decided to use the 'ryo' (roll-your-own) output format, which allows you to specify your own output format:

% exonerate --model affine:local --revcomp --ryo "%qi %qab %qae %ti %tab %tae %s %pi" --showalignment no --bestn 1 introns.fa genome2.fa

where -model affine:local  will give local alignments,

--revcomp looks on both strands for matches,

--ryo "%qi %qab %qae %ti %tab %tae %s %pi" prints out the query id., start and end of the alignment in the query, target id., start and end of the alignment in the target, alignment score, and percent identity of the alignment.

--showalignment no: means the alignment isn't shown

--bestn 1 : just give the best hit for each query.

Other useful exonerate options 
--model protein2genome:bestfit : this includes the entire protein in the alignment.
--exhaustive yes :  makes more accurate alignments (but is slower).

10 comments:

Unknown said...

Great post. I have also used a similar approach to yours of using BLAST prior to Exonerate. In mine, I have used Scipio, which is rather fast, to identify matching protein/genome regions. I then run Exonerate. I was wondering if you know how other aligners such as Scipio perform relative to Exonerate (not with speed, but with respect to alignment quality)... my approach has certainly worked to help me speed up Exonerate runs, but I wonder if a tool like Scipio could act as a replacement. Appreciate your thoughts!
Best,
Marcus

Unknown said...

Thank you very much. Very useful!

ramadatta said...

You blog posts helped me. Thank you.

Unknown said...

Do you have a good method for jumping from exonerate protein-to-genome alignments to complete CDS predictions?

I'm aligning proteins to genomes from related species, often I end up with an alignment lacking start/stop positions.

There has to be a sensible was to crawl out from the first and last exons to find in-frame start/stop codons.

Unknown said...
This comment has been removed by the author.
Unknown said...

Is there any way to get the output exactly the same as given by running exonerate as like this with your code:

exonerate -m p2g -q uniref90_modified.fasta -t PcapLG_18.fasta --softmasktarget yes --showvulgar no --showalignment no --showtargetgff yes --showquerygff no --geneseed 200 --percent 80 > LG18_exonerate_pcap.gff

Thanks

Unknown said...

This post is excellent. It nailed exactly what I was looking for in the first 2 paragraphs including even the GFF output !!
I feel silly now wasting time on the nightmare of the spaln alignment tool

Unknown said...

This post is excellent. It nailed exactly what I was looking for in the first 2 paragraphs including even the GFF output !!
I feel silly now wasting time on the nightmare of the spaln alignment tool

Unknown said...

Very informative blog. I have an issue with exonerate tool. I am trying to map a short peptide seq (11 aa) to a whole genome but tool does not give any output without any error. only shows -- completed exonerate analysis

My command is

./exonerate-2.2.0/setup_exo_tool/bin/exonerate --model protein2genome --query aaa.fasta --target genome.fasta

is it short length of a peptide causes failure? Any suggestion from expert ?



Raj

Avril Coghlan said...

I just noticed a question from Adam Taranto above about jumping from exonerate protein-to-genome alignments to complete CDS predictions (with proper splice sites and start / stop codons). One way I've done this in the past is to use the gff file from exonerate to create a hints file for Augustus.