Monday 25 February 2013

Using RATT to transfer gene predictions between species

The RATT software by Thomas D. Otto (described in a paper by Otto et al. 2010) can be used to transfer gene predictions (or other annotations) from one species to another (or from one version of an assembly for a species, to a different assembly for the same species, or from one strain of a species to a different strain of that species). It is easy to run.

How does RATT work
RATT transfers annotations based upon conserved synteny between assemblies (of one strain), strains or species. First the two sequences (reference assembly with annotations, and the new assembly that you want to transfer annotations to) are compared using 'nucmer' from the MUMmer package, to identify syntenic regions. To be included, synteny blocks must have at least 40% sequence identity at the nucleotide level. Based on the nucmer alignments, RATT identifies a syntenic region in the new assembly for each region in the reference assembly (the one with annotations). It then recalibrates the alignment coordinates within syntenic regions by using SNPs (identified using show-snp from the MUMmer package) as unambiguous anchor points within synteny blocks.

RATT next proceeds to the annotation-mapping step, where it tries to transfer each gene prediction (or other annotation) from the reference assembly to the new assembly. A feature is not mapped if it spans a synteny break, if its coordinates match different chromosomes or different strands in the new assembly, or if the new mapped distance of its coordinates has increased by more than 20 kb. After transferring the annotations to the new assembly/genome, RATT refines the transferred gene predictions to make sure that the predicted start and stop codons look alright, to avoid internal stop codons, and to fix incorrect splice sites.

The RATT paper reported that RATT has good accuracy (shown to be better than the Glimmer gene prediction software), but occasionally makes errors in determining borders of features, as well as in transferring small features, gene families, genes in regions of limited synteny, or very divergent genes.

Running RATT
Say you want to transfer gene predictions from one species (species 1) to another (species 2). 
First, you need embl files of the original gene predictions that you want to transfer, with coordinates with respect to the genome assembly for species 1. If your original gene predictions are in a gff file, you can convert them to embl files (one per chromosome/scaffold) using my perl script Note that this script expects that (i) for each gene, the features in the input gff file for that gene are in the order gene line, mRNA line, CDS line(s); (ii) the scaffold/chromosome names used are the same between the input gff file and input genome fasta file; (iii) the columns in the gff file are separated by tabs and not spaces.

Secondly, you need an assembly fasta file for the genome of the species that you want to transfer gene predictions to (species 2).

The command to use RATT is:
% setenv RATT_HOME /nfs/users/nfs_t/tdo/Bin/ratt/
% $RATT_HOME/ ratt_input_embl species2.fa Transfer1 Species
where '/nfs/users/nfs_t/tdo/Bin/ratt/' is the directory where you have installed RATT,
'ratt_input_embl' is the directory with the embl files for species 1,
'species2.fa' is the fasta file for the genome assembly for species 2,
'Transfer1' is the prefix that will be given to the results file,
'Species' tells RATT that you are transferring annotations between species (if you are transferring between different assemblies for the same species, you can put 'Assembly' here).

I find that RATT requires quite a lot of memory (RAM). For example, to transfer gene predictions to a 135 Mbyte genome, I needed to request 5 Gbyte of memory when I was running it on the Sanger compute farm.

Interpreting RATT's output
In the RATT standard output (goes to the .o file if you are running RATT on the Sanger farm), you get a summary of what RATT has done, looking something like this:
Overview of transfer of annotation elements:
385     elements found.
385     Elements were transfered.
0       Elements could be transfered partially.

2       Elements split.
0       Parts of elements (i.e.exons tRNA) not transferred.
0       Elements couldn't be transferred.

385     Gene models to transfer.
385     Gene models transferred correctly.
0       Gene models partially transferred.
0       Exons not transferred from partial CDS matches.
0       Gene models not transferred.

803 Sequences where generated

For each chromosome/scaffold, RATT makes an output file in embl format called [...]final.embl (eg., with the gene predictions in embl format in your target species (species 2). It is possible to convert these to gff format, using my Perl script You can do this using a shell script like this:
#!/usr/bin/env tcsh
foreach file (*final.embl)
      echo "$file"
      perl -w $file $file.gff . yes

For each chromosome/scaffold, RATT makes an output file called [...]report.gff (eg., which records the differences between the reference annotation and the transferred annotation, including any changes made by RATT in the 'refinement' step, and any problems that RATT thinks are remaining in the final transferred gene predictions (eg. if a gene prediction is missing a start codon).

RATT will make [...]Report.txt files that tell you whether gene predictions were transferred without any potential errors in the transferred gene predictions (eg. the start codon or stop codon of the transferred gene prediction looks wrong). I found that some errors were reported for my transferred gene predictions in these report files. Thomas Otto suggested that at least 80% of the CDS should be transferred for a transferred gene prediction to be used with any confidence.

Problems transferring genes using RATTThere may be a few genes that RATT said it could not transfer. Sometimes I have found that it is possible to transfer these manually using exonerate:
% exonerate -t assembly.fa -q protein.fa --model protein2genome --bestn 1 --showtargetgff > output.exonerate
where assembly.fa is your new assembly, protein.fa is the translation of the gene you want to transfer, output.exonerate is the exonerate output file.

Another thing that I find is that in a few cases RATT transfers (or partially transfer) the gene, but the transferred gene model is not at all similar to the original one. It is possible to weed out cases like this by aligning the translation of the original gene model to the RATT-transferred one, using ggsearch (for global alignment) or ssearch (for local alignment) in the FASTA package.

Viewing RATT's output in artemis/act
You can view transferred annotations, as well as the information in the report.gff file for a chromosome/scaffold, in Artemis, by typing:
[Note to self: I first need to type: 'ssh -Y pcs4']
% art chr1.fa + +

You will see something like this - this example shows a gff tag where the report.gff has highlighted the fact that a gene prediction is missing a start codon (the translation starts with 'I'):


Bioinformatics@VR said...

how to set variable for ratt .

Bioinformatics@VR said...

how to set RATT_HOME variable ??????

Avril Coghlan said...

This is done with the line:

setenv RATT_HOME /nfs/users/nfs_t/tdo/Bin/ratt/