Monday, 5 November 2012

Training GeneWise for your species

Ewan Birney's GeneWise software is a very nice gene-finding software that can find genes in a newly sequenced genome by using comparisons to proteins from other species, or HMMs of gene families from other species.

There are a couple of different ways that GeneWise can deal with introns. If you use the command:
% genewise -splice_gtag .... [other options]
then GeneWise will assume that the introns in your species all start with 'GT' and end with 'AG'. This is generally true, but a small proportion of introns in each species start and end with other sequences.

Alternatively, you can use an intron splice site file, which is a parameter file that tells GeneWise what sequences to expect near the splice site of introns (including several bases upstream and downstream of the start and end of the intron).

GeneWise comes with an intron splice site file for human (file "gene.stat"), which is the default parameter file used by GeneWise, if you just type:
% genewise ... [other options]

However, it can often be nice, if you have training data, to train GeneWise for your species. To do this, you first need a set of training genes that you know are correctly predicted (eg. genes confirmed by full-length mRNAs) or are probably correctly predicted (eg. genes predicted by Ian Korf's CEGMA software).

If you have a GFF file of the coding exons in your training genes, you can then use my perl script to make a splice site parameter file for GeneWise for your species.

You can then use the GeneWise parameter file by typing in GeneWise:
% genewise -genestats <paramfile> -nosplice_gtag ... [other options]
where <paramfile> is your parameter file made using

Note that the GeneWise parameter file made by just has the splice site information trained from your training data; other parameters in the parameter file are copied from the human splice site file ("gene.stat") that comes with GeneWise.

No comments: