Monday 22 April 2013

Training the Augustus gene-finding software

I am currently learning how to train the Augustus gene-finding software developed by Mario Stanke. There is a nice tutorial on training Augustus here.

Why train Augustus? 
Augustus has already been trained for many different species, which are listed in the Augustus README.TXT file, eg. human, Drosophila melanogaster, Caenorhabditis elegans, etc. To see the list of species that Augustus has already been trained for, you can type:
% augustus --species=help

To run Augustus on a new species that it has not been trained for before, it is a good idea to train it first on a training set for that species, because Augustus uses parameters that are species-specific.

These include the Markov chain transition probability of coding and non-coding (intron or intergenic) regions. These are stored in the Augustus 'config' directory, in files <species>_exon_probs.pbl, <species>_intron_probs.pbl, and <species>_igenic_probs.pbl, where <species> is the name of your species.

For each species there are also 'meta parameters' like the order of the Markov chain, or the size of the window used for the splice site models. These 'meta parameters' are stored in a file called <species>_parameters.cfg, which <species> is the name of your species. 

In summary, Augustus trains features such as the intron and exon length distributions, splice site patterns, translation start codon patterns, branch point regions of introns, etc.

Preparing a training set to use for training Augustus
To train Augustus, you need to provide Augustus with a training set of gene models that you know to be 100% correct, or think are likely to be nearly 100% correct (based on other evidence).

In the retraining.html file that comes with Augustus, it is recommended to use a training set of at least ~200 gene predictions. It is also recommended that the number of multi-exon genes should be relatively large (in order to train introns); and that it is important that all the start codons are 100% correct, but less important to be confident that all the stop codons are 100% correct.

To create a training set for Augustus, I made an initial set of gene predictions by doing the following:
(i) I transferred curated genes from a closely related species to the assembly of my species of interest, using the RATT software;
(ii) I made gene predictions in my species using the exonerate software, based on alignments of ESTs for my species of interest,
(iii) I predicted conserved genes in the assembly of my species using the CEGMA software.

I then gave this initial set of gene predictions (as embl format files) to expert genome analysts at Sanger (Karen Brooks, Helen Beasley, and Alan Tracey), who manually curated (edited) ~200 gene predictions for my species in the Artemis software, using additional evidence from splice sites, BLAST matches, multiple alignments, and mapped RNA-seq data.

The genome analysts found that the CEGMA predictions were most useful as a source of initial gene predictions, which they then manually curated (edited). They saved the curations in embl format files.

This resulted in a high-confidence training set of genes, that could be used for training Augustus. It needs to be converted into a genbank-format file, to train Augustus (see below).

Preparing the genbank-format file for your training set
As mentioned above, the Sanger genome analysts manually curated a set of gene predictions for me to use as a training set. They gave them to me in embl format files. I converted these embl format files to gff format files, using my Perl script embl_to_gff.pl

Augustus has some requirements regarding the training set, which I then had to check:
(i) the training set genes must not overlap each other,
(ii) the training set genes should be non-redundant,
(iii) only one transcript per gene is allowed.

To check that the training set genes do not overlap each other (criterion (i) above), I first converted the embl-format files to gff-format files using my embl_to_gff.pl Perl script. I then checked whether any of the genes in the gff file overlap, using the Bedtools software.

To check that the training set genes are non-redundant (criterion (ii) above), I used my script get_spliced_transcripts_from_gff.pl to infer the spliced DNA sequences of transcripts from each gene. I then used my translate_spliced_dna.pl script to infer the amino acid sequences of the transcripts, from the DNA sequences. I then used my script calc_pc_id_between_seqs.pl to calculate the percent identity between each pair of protein sequences, based on a global pairwise alignment generated using ggsearch in the FASTA package by Bill Pearson. In the retraining.html file that comes with Augustus, it is recommended that the proteins in the training set should be less than 70% identical. In my case, I found that none of the proteins had percent identities of 70% or higher.

Augustus also requires just one transcript per gene (criterion (iii) above). In my case, the training set just had one transcript per gene, so that was fine.

Augustus expects that the region outside the training genes is intergenic DNA. Therefore, if you have just a few training genes in a whole genome assembly, you should just cut out about 1000 bp on either side of each training gene to give to Augustus. I can do this with my script get_flanking_regions_of_genes.pl.

Augustus requires needs to have the training set in genbank format files. I converted the gff-format files that I had made (from my embl format files) to genbank format, using my Perl script gff_to_genbank.pl. This makes one genbank file per scaffold, and you can concatenate them together to make one genbank file for all scaffolds (this is what Augustus needs).

Dividing your training set into training set and test set
If you want to get an idea of the accuracy of Augustus after you have trained it (see 'Calculating Augustus's prediction accuracy' below), you will need to divide your GenBank-format training set into training and test set, eg. PTRK_training.gb and PTRK_test.gb.  The script 'randomSplit.pl' that comes with Augustus in the scripts directory does it correctly. You still need to have at least ~200 genes in your training set (PTRK_training.gb), so your training set might not be big enough to do this.

Downloading augustus
To train augustus, you will first need to download the Augustus software from the augustus downloads page. In my case, I was using Augustus 2.6.1. When you do the training, Augustus will write some files in the directory where you have installed it, so you will need to have write access to that directory.

Testing whether your genbank-format training file can be read by Augustus, using etraining
My colleague Magdalena Zarowiecki suggested that it is a good idea to first check whether Augustus can read your genbank-format training file.

To do this, in the directory where you have installed Augustus, in the subdirectory config/species make a new subdirectory 'Test1' (config/species/Test1). Then copy all the files from config/species/generic/ here, eg. generic_exon_probs.pbl, etc. Then rename the copies as Test1_..., eg. Test1_exon_probs.pbl, etc. Then edit the file Test1_parameters.cfg so it refers to the Test1 files instead of to the 'generic' files.

Now you can try running the Augustus training program, called 'etraining', on your genbank-format training file, to check if it can read it ok:
% setenv AUGUSTUS_CONFIG_PATH /nfs/users/nfs_a/alc/Documents/bin/augustus/config/
This sets the path to the 'config' directory in the Augustus installation.

Now run etraining on your genbank-format training file (eg. PTRK_training.gb):
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/etraining --species=Test1 PTRK_training.gb

My training file PTRK_training.gb contained 90 different genes on 8 scaffolds. 
The output from etraining looks like this:
# Read in 8 genbank sequences.
...
Frequency of stop codons:
tag:    6 (0.0667)
taa:   77 (0.856)
tga:    7 (0.0778)
end *EXON*
Storing parameters to file...
Writing exon model parameters [1] to file /nfs/users/nfs_a/alc/Documents/bin/augustus/config/species/Test1/Test1_exon_probs.pbl.


Here etraining has told me that it finds 6 genes ending in a 'TAG' stop codon, 77 ending in a 'TAA' stop codon, and 7 ending in a 'TGA' stop codon. 6+77+7=90, so etraining is counting the expected number of genes (90 genes). It looks like etraining read in my training file fine.

Creating parameters files for your species 
In the directory where you installed Augustus, you will find subdirectories for different species (eg. 'elegans', 'brugia', etc.), and a directory called 'generic'. In the directory 'generic', you will see 7 files:
generic_exon_probs.pbl  
generic_igenic_probs.pbl  
generic_intron_probs.pbl  
generic_metapars.cfg  
generic_metapars.utr.cfg  
generic_parameters.cfg  
generic_weightmatrix.txt
These are the files with the generic parameters. To make parameter files for your species, you should make a new subdirectory 'myspecies' in the directory where you installed Augustus (config/species/myspecies), eg. myspecies='ParastrongyloidesTrichosuri'. Then copy all the files from config/species/generic/ here, eg. generic_exon_probs.pbl, etc. Then rename the copies as myspecies_..., eg. myspecies_exon_probs.pbl, etc. Then edit the file myspecies_parameters.cfg so it refers to the 'myspecies' files instead of to the 'generic' files, eg. edit it so that it points to myspecies_intron_probs.pbl instead of to generic_intron_probs.pbl.

Optimise the parameters in your myspecies_parameters.cfg file, using optimize_augustus.pl
In Augustus, the parameters like the size of the window of the splice site models, and the order of the Markov model, are called 'meta parameters'. These parameters are stored in the myspecies_parameters.cfg file that you made just above.

To train Augustus for a new species, you need to optimise the values in the myspecies_parameters.cfg. You can do this using the optimise_augustus.pl script that comes with Augustus. First you need to tell Augustus where the directory is with your parameter files:
% setenv AUGUSTUS_CONFIG_PATH /nfs/users/nfs_a/alc/Documents/bin/augustus/config/
Then run optimise_augustus.pl. You can specify the number of rounds of optimisation to do using the --rounds option (eg. --rounds=7). By default, it does 5 rounds of optimisation:
% /nfs/users/nfs_a/alc/Documents/bin/augustus/scripts/optimize_augustus.pl --species=myspecies  --metapars=/nfs/users/nfs_a/alc/Documents/bin/augustus/config/species/myspecies/myspecies_metapars.cfg --aug_exec_dir=/nfs/users/nfs_a/alc/Documents/bin/augustus/bin/ PTRK_training.gb
where PTRK_training.gb is my genbank-format training file,
--metapars gives the names of the metaparameters config file,
--aug_exec_dir gives the directory with the augustus and etraining executables.

On the Sanger compute farm, this needs to be submitted to the 'basement' queue (as it can take >48 hours), with about 1000 Mbyte of RAM.

It will say something like:
Splitting training file into 8 buckets...
Reading in the meta parameters used for optimization from /nfs/users/nfs_a/alc/Documents/bin/augustus/config/species/Test1/Test1_metapars.cfg...
Reading in the starting meta parameters from /nfs/users/nfs_a/alc/Documents/bin/augustus/config/species/Test1/Test1_parameters.cfg...

bucket 1 2 3 4 5 6 7 8
...
...
Making final training with the optimized parameters.
This took about 1 day and 2 hours to run one round of training, using a training set of 90 genes.
This made files myspecies_parameters.cfg.orig1, myspecies_parameters.cfg.orig2, myspecies_parameters.cfg.orig3 , myspecies_parameters.cfg.orig4, myspecies_parameters.cfg.orig5 in the AUGUSTUS_CONFIG_PATH directory. The final parameters are put into myspecies_parameters.cfg.

Training Augustus with the parameters in your myspecies_parameters.cfg file, using etraining
After running optimize_augustus.pl, you have to train Augustus with the values of the metaparameters in your myspecies_parameters.cfg file. 

You can do this using the 'etraining' program:
% setenv AUGUSTUS_CONFIG_PATH /nfs/users/nfs_a/alc/Documents/bin/augustus/config/ 
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/etraining --species=myspecies PTRK_training.gb
where PTRK_training.gb is the GenBank-format training set.

This takes only a second or so to run. It writes exon, intergenic, and intronic model parameters to the files myspecies_exon_probs.pbl, myspecies_igenic_probs.pbl, and myspecies_intron_probs.pbl in the directory /config/myspecies that contains your parameter files. For example, myspecies_exon_probs.pbl has the probabilities of different lengths for exons of different types (eg. single-gene exons, initial exons, internal exons, terminal exons).

Calculating Augustus's prediction accuracy
Once you have trained Augustus using optimize_augustus.pl and etraining, if you have a test set that is separate from your training set, you can now check the prediction accuracy of your trained version of Augustus on the test set:
% setenv AUGUSTUS_CONFIG_PATH /nfs/users/nfs_a/alc/Documents/bin/augustus/config/ 
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus --species=myspecies PTRK_test.gb
where /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus is the path to the version of Augustus that you installed,
PTRK_test.gb is the GenBank-format file of test set sequences.

If you don't have a separate test set from your training set, you can try calculating the prediction accuracy using your training set. However, this will overestimate the prediction accuracy of Augustus for your species, since you have trained Augustus on the training set (so it should work well on those genes):
% setenv AUGUSTUS_CONFIG_PATH /nfs/users/nfs_a/alc/Documents/bin/augustus/config/ 
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus --species=myspecies PTRK_training.gb
The end of the output will then contain a summary of the accuracy of the prediction, eg.
nucleotide level sensitivity: 0.997
nucleotide level specificity: 0.0407
exon level sensitivity: 0.918
exon level specificity: 0.0439
gene level sensitivity: 0.8
gene level specificity: 0.0415 

The Augustus retraining.html says the gene level sensitivity is below 20% it is likely that the training set is not large enough, that it doesn't have a good quality or that the species is somehow 'special'. 

Running your trained version of Augustus
If you want to run the version of Augustus that you have trained, you will need to tell Augustus where is the directory with your parameter files, eg.:
% setenv AUGUSTUS_CONFIG_PATH /nfs/users/nfs_a/alc/Documents/bin/augustus/config/  
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus --AUGUSTUS_CONFIG_PATH=/nfs/users/nfs_a/alc/Documents/bin/augustus/config/ --extrinsicCfgFile=/nfs/users/nfs_a/alc/Documents/bin/augustus/config/extrinsic/extrinsic.M.RM.E.W.cfg --species=myspecies
where /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus is the path to the version of Augustus that you installed,
--AUGUSTUS_CONFIG_PATH points to the directory with your parameter files,
--extrinsicCfgFile points to the extrinsic.cfg file (that contains parameters that Augustus uses for different types of hints),
--species specifies your species name.

Summary of steps for training Augustus
1) Make a set of curated genes, in a GenBank-format file.
2) Check if any of the training genes overlap each other, or have very similar protein sequences.
3) Make a subdirectory for your species in the augustus/config directory, copy the generic parameter files there, rename and edit them to say your species name.
4) Run etraining to check that your GenBank-format file is read ok by Augustus, and that it counts the correct number of genes.
5) Run optimize_augustus.pl to train the meta-parameters.
6) Run etraining to train the intron, exon, intergenic probability files.

Other training possibilities
There are also another few things that you can train in Augustus:
(i) The file myspecies_weightmatrix.txt. The purpose of this file is described in the Augustus retraining.html file, which says that it usually isn't necessary to change the values in this file.
(ii) You can also provide a file of known splice site sequences for your species, that can be used for training. The name of this file is specified in the ..._parameters.cfg (eg. Test1_parameters.cfg) file. See the Augustus retraining.html file for details. This is optional, it's not necessary to provide this file.
(iii) You can train the parameters that Augustus uses weights for different types of 'hints', ie. you can train the parameters in the file 'extrinsic.cfg'. When the process (program) that generates the hints is new, you need to then train the weights for the new type of hints. There is information on how to do this in the Augustus README file.

A big thanks to Magdalena Zarowiecki and Jason Tsai for help training Augustus.
Thank you also to Mario Stanke for helpful replies to my questions on preparing a training set.

7 comments:

Unknown said...

Hi
Thanks, that was very informativ and helpful. I was just curious what you received after all that effort (in regard to specificity) ?
Did you actually try to take a model which is related to yours and add hint files instead to compare the results ?

Cheers
Emanuel

Unknown said...

I was wondering how much this process improved hta augustus predictions. I've tried to do this a couple of times for specific crops, using RNASeq data, but didn't improve significantly compared to just using the basic Arabidopsis trainingset. In the end I reverted to improving my prediction afterwards with Trinity and PASA. But I'm still thinking of retraining Augustus again.

Unknown said...

Thanks Avril, great description.

@Marcel I trained it on a curated set an further down the line used to feed introns and exonparts as hints in from cufflinks for the final predictions and it improved the accuracy quite a bit, but then protein and mRNA evidence was basically non-existing.

Unknown said...

If you use Parallel::ForkManager, the optimize_augustus will run in parallel, cutting down the runtime drastically (depending on the number of cores).

Unknown said...

another useful thing:

it seems like the training set shouldn't contain terminal stop codons split across a splice site.

Unknown said...

Hi, Avril. I really like your blog. I come here sometimes looking for solutions.
When you were calculating the percent identity of the protein using your own script, did you got a lot of segmentation fault erros?

Ankit Hinsu said...

Hi Avril,

I am running optimise_augustus.pl script and everytime it gives following error.
Splitting training file into 8 buckets...
Genbank input file appears to have fewer records than expected.
This could be a consequence of using DOS (Windows) carriage return symbols at line breaks. at /data1/augustus-3.2.3/scripts/optimize_augustus2.pl line 388, chunk 11900026.

I manually checked my file, but there is no DOS carriage return (all returns are UNIX format). What could be missing in my file.