Monday 21 January 2013

Finding the highest-scoring non-overlapping gene predictions

If you have used GeneWise to make gene predictions for a species using different HMMs as input (for example, by using my perl script run_genewise_after_blast.pl), GeneWise may make several overlapping gene predictions in each region of a chromosome/scaffold.

In this case, we are probably just interested in the most convincing gene predictions, ie. those that have been assigned the highest scores by GeneWise.

I've just written a script find_best_genewise_genes.pl that will take a gff file of overlapping gene predictions (all with scores from the same source, eg. from GeneWise), and will make you an output gff file with just the set of highest-scoring non-overlapping gene predictions.

To do this, my script uses a greedy algorithm:
(i) it sorts the genes along each scaffold in order of decreasing score
(ii) it takes the highest-scoring gene for the scaffold, and removes all genes that overlap that gene
(iii) it takes the next highest-scoring gene in the scaffold, and removes all genes that overlap that gene
and so on, until all the genes in the scaffold have been considered.

This script could be used to filter out many gene predictions from a gff file, and could be used for a gff file of gene predictions from any source (not just GeneWise), as long as the gene scores in the file are all on a comparable scale.

No comments: