Friday, 26 April 2013

Using eval to get statistics for a gene set

The eval program from Michael Brent's lab is very useful for getting summary statistics and accuracy statistics for a gene set. It is described in a paper in BMC Bioinformatics in 2003. The paper says that eval can be run via a GUI or the command-line.

The paper describes these features of eval:
(i) It can produce histograms of a particular variable (eg. the number of exons per predicted gene), and can also categorise exons or genes by their length or GC content and shows the accuracy (eg. exon sensitivity) for each category.
(ii) eval can also compare multiple gene prediction sets for a genome, for example, by building clustering of genes or exons that share some property (eg. are identical, or overlap), and then present the result as Venn diagrams.
(iii) eval can also be used to select sets of genes that match another set by certain criteria, such as exact match, genomic overlap, one or more introns match, one or more exons match, start codon match, etc.

The eval software and documentation are available for download from the Brent lab webpage. The current version is 2.2.8 (as of September 2013). The description of the command-line interface for eval starts on page 27 (section 2.4) of the documentation pdf.

Converting your gff file to the correct format for eval
eval requires that the input file is in gtf format. To convert a gff file of your gene predictions (which has the sequences of the scaffolds at the end of the gff file, after a line saying '##FASTA') into an eval-friendly gtf file, you type:
% /software/pathogen/external/apps/usr/local/maker-2.28/bin/maker2eval_gtf my.gff > my.gtf
where my.gff is your gff file of gene predictions (which must have your fasta-format sequences at the end, after a line saying '##FASTA' - don't forget this!),
           my.gtf is an eval-friendly gtf file.
The script  maker2eval_gtf is a script that comes with the Maker gene prediction software, for convering maker output gffs to eval-friendly gtf files. It might also work for other gff files of gene predictions (I haven't checked yet).

Validating your gtf file:
You can check whether your gtf file is valid by using the script:
% perl -I ~tk6/softwares/eval-2.2.8 ~tk6/softwares/eval-2.2.8/ my.gtf > my.gtf_validation
This picks up problems with the file, such as features whose start position is after their stop position, CDS features that are after the stop codon in a transcript, transcripts with inconsistent phase values for the different CDS features, and so on.

Summary statistics:
Then, run eval:
% perl -I ~tk6/softwares/eval-2.2.8 ~tk6/softwares/eval-2.2.8/ my.gtf > stats
where stats is the eval output.
[Note: 'perl -I' adds an extra directory to the Perl module search path when invoking the Perl interpreter.]

The eval output looks like this:
        Count                             81.00   
        Total Transcripts            81.00   
        Transcripts Per              
        Count                              81.00   
        Average Length              1802.38 
        Median Length               1349.00 
        Total Length                   145993.00
        Average Coding Length 1538.30 
        Median Coding Length  1146.00 
        Total Coding Length      124602.00
        Average Score                 0.00    
        Total Score                      0.00    
        Ave Exons Per                 2.75    
        Med Exons Per                2.00    
        Total Exons                     223.00  

There are lots more statistics on partial genes, single-exon genes, exons (subdivided into initial, internal, terminal, single exon genes), etc.

Accuracy statistics: and 
If you have an independent set of curated genes for your species (which you think are 100% or nearly 100% correct), you can also use eval to give you specificity and sensitivity statistics for your gene prediction set.

For example, say your training set of gene predictions is in a file called training.gff, and it has the sequences of the scaffolds at the end of the gff file, after a line saying '##FASTA'. And say you have a gene prediction set in a file called predictions.gff (also with the sequences of scaffolds at its end), you can run eval by first making gtfs for eval:
% /software/pathogen/external/apps/usr/local/maker-2.28/bin/maker2eval_gtf training.gff > training.gtf 
% /software/pathogen/external/apps/usr/local/maker-2.28/bin/maker2eval_gtf predictions.gff > predictions.gtf

Then launch eval: [note: this requires the module, which isn't on farm3 yet]
% perl -I ~tk6/softwares/eval-2.2.8 ~tk6/softwares/eval-2.2.8/ training.gtf predictions.gtf
 [Note to self: I need to do 'ssh -Y pcs4' to run this, as eval brings up a GUI.]

 This should bring up a GUI. You need to select 'training.gtf' as the annotation set in the top pane, and 'predictions.gtf' as the prediction set in the bottom pane. You can then press the button 'Run eval', and should get an output like this:

This tells us that the sensitivity is 98.8% on the nucleotide level, 59.0% on the exon level, and 45.7% on the transcript (mRNA) level.

Instead of using (which gives you the GUI version of eval), you can use the command-line script
% perl -I ~tk6/softwares/eval-2.2.8 ~tk6/softwares/eval-2.2.8/ training.gtf predictions.gtf
This seems to need a lot of memory when I submit it to the compute farm, I requested 2000 Mbyte.

Note: [6-Oct-2015]: My colleague Diogo Ribeiro has noticed that there is a problem with Eval using a reference gene set to calculate sensitivity/specificity of a new gene set. The problem is that Eval does not take into account the scaffold/contig/chromosome where the feature is present, therefore matching CDSs/exons/transcripts/nucleotides that have the same coordinates even though on different scaffolds. Thanks Diogo for pointing this out! [Note to self: Diogo has a script that will calculate these statistics accurately.
16-Oct-2015: see further comment below]

Filtering gtf files:  
This takes a filter file, a gtf file for your annotation, a gtf file for your predictions, and then will filter the predictions according to the filter file. The eval documentation gives more information.

Making graphs:
This takes a graph file, an annotation gtf, a prediction gtf file, and makes some graphs. The eval documentation describes this further.

Finding overlaps between gene sets: 
This takes one or more sets of gtf files, and builds overlap clusters from them. See the eval documentation for more details.

Getting distributions for variables:
This will make histograms of different variables for you. See the eval documentation.

Thanks to Eleanor Stanley for showing me how to use eval.

Comment 16-Oct-2015: 
I've just looked again at the Eval documentation, and noticed it says:
"Although the GTF specification does not state that all genes in a gtf file must be from the same sequence or in the same coordinate system, this is a requirement for using the Eval software. Any GTF file used by any of the programs or lib raries described below must contain annotation of a single sequence with a ll genes in the same coordinate system (that of the sequence they annotate)."

That is, Eval can only work on a gtf file that has just one scaffold. So Eval might not actually have a bug, we just didn't realise we had to run it this way. (I imagine lots of people make that mistake!)

For Eval's (calculating accuracy of a gene set), I think this means that we would need to make separate 'reference' and 'prediction' gtf files for each scaffold, and run on each scaffold separately, then write our own script to integrate the results.

For Eval's (calculating stats such as number of exons, length of intergenic DNA), it might not such a big difference, but I remember that Eval did seem to give some strange numbers for things such as amount of intergenic DNA and this might mean that we would need to use a separate gtf file for each scaffold.

No comments: