Augustus
Augustus is a gene-finding software based on
Hidden Markov Models (HMMs), described in papers by
Stanke and Waack (2003) and
Stanke et al (2006) and Stanke et al (2006b) and
Stanke et al (2008). It has several interesting features:
- a
Hidden Markov Model (HMM)
is used for gene prediction, containing states for introns, exons,
splice sites, etc. The emission probabilities for bases for each state,
and the transition probabilities between states are estimated based on a
training set of annotated sequences for a species. For a given DNA sequence, Augustus computes the sequence of states and emission lengths that is most likely.
- a way of
modelling intron lengths:
it models the distribution of short introns (which tend to cluster
around a certain length) very accurately, and uses a geometric
distribution only for lengths of long introns.
- uses the empirical distribution as the probabilistic model for
donor splice sites, and for
acceptor splice sites.
For the donor splice site, the empirical distribution is smoothed in a
way that takes into account that patterns 'similar' to a frequent splice
site pattern often also are frequent splice site patterns.
- a model for a short region directly upstream
(-8 to -4) of the donor splice site that takes the reading frame of the exon into account.
- a method to
train the model parameters that depends on the GC-content of the sequence.
Download Augustus
I have used Augustus 2.6.1 for the following. You should download the latest version from the
Augustus website.
Training Augustus
In order to run Augustus, it is a good idea to first train Augustus for your species. Once you have done this, you will then have the trained version of Augustus in a particular folder
[Note to self: I am using: /nfs/helminths/users/jit/bin/augustus.2.6.1/config/species/Sratti.final/].
Making hints files for Augustus using TopHat and Cufflinks
As explained in
Stanke et al (2008), there are 16 types of 'hints' that Augustus can take: start (translation start), stop (translation stop), tss (transcription start), tts (transcription termination), ass (acceptor splice site), dss (donor splice site), exon (exact exon), exonpart (part of exon), intron (exact intron, in CDS/UTR), intronpart (part of an intron), CDS (exact coding part of exon), CDSpart (part of a coding part of exon), UTR (exact UTR), UTRpart (part of a UTR), irpart (part of an intergenic region), and nonexonpart (part of an intergenic region or intron).
A 'hint' refers to a region, and can have strand and reading frame information, and a score, as well as information on the source of evidence. The file containing hints of all types are concatenated into one 'hints' file for Augustus. Hints are 'grouped' so that all hints from one group
are thought to belong to the same transcript.
There is a nice webpage on the
Augustus website about using RNA-seq data.
My colleague Jason Tsai suggested a protocol for making a 'hints' file for Augustus with 'intron' and 'exonpart' hints, based on ideas in a paper by
Trapnell et al (2012). These are the steps, assuming you start with a set of bam files of mapped RNA-seq data (one file for each RNA-seq sample):
(i) Merge your bam files into one bam, eg.
% samtools merge merged.bam 4982_8.bam 5287_6.bam 7007_1.bam
where merged.bam is the merged bam file, and the sample bam files are 4982_8.bam, 5287_6.bam & 7007_1.bam.
In my case, for 15 bam files, this took about 3 hours and used about 15 Mbyte of RAM.
(ii) Make intron 'hints' for Augustus, using the bam2hints program that comes with Augustus, eg.
% /nfs/users/nfs_a/alc/Documents/bin/augustus/auxprogs/bam2hints/bam2hints --intronsonly --minintronlen=15 --in=merged.bam --out=merged.intronhints.gff
where /nfs/users/nfs_a/alc/Documents/bin/augustus is where Augustus was installed, merged.bam is your merged bam file, merged.intronhints.gff is the output file of intron hints.
In my case, this took about 45 mins to run, and needed 400 Mbyte of RAM.
The merged.intronhints.gff file looks like this:
Sratt_Chr00_000001 b2h intron 2591 3102 0 . . pri=4;src=E
Sratt_Chr00_000001 b2h intron 4191 4247 0 . . pri=4;src=E
Sratt_Chr00_000001 b2h intron 4600 4635 0 . . mult=1691;pri=4;src=E
Sratt_Chr00_000001 b2h intron 4600 5030 0 . . mult=130;pri=4;src=E
Sratt_Chr00_000001 b2h intron 8601 8649 0 . . pri=4;src=E
Sratt_Chr00_000001 b2h intron 10729 10764 0 . . mult=182;pri=4;src=E
Sratt_Chr00_000001 b2h intron 10729 10781 0 . . pri=4;src=E
...
There seems to be 'pri=4' on every line of the file, this gives the priority of the hint (see the
augustus README.TXT). The 'mult=x' value gives the number of reads that supported the intron. 'src=E' means the data comes from ESTs (see
here).
(iii) Make transcript assemblies based on your RNA-seq data, using cufflinks. In the
Trapnell et al (2012) paper, it is recommended to make a transcript assembly for each sample (bam file), and then afterwards merge the assemblies from the different samples using cuffmerge. This is partly because each sample may have a different mix of transcripts (especially if they are for different conditions), but also because it is computationally intensive to make the transcriptome assembly with cufflinks. We can make a transcript assembly for a sample using a command like this:
% /nfs/helminths/users/jit/bin/cufflinks-2.0.2.Linux_x86_64/cufflinks --min-intron-length 15 -p 8 -o 4982_8.cuff 4982_8.bam
where -p8 tells cufflinks to run the job on 8 processors, 4982_8.bam is the input bam file, 4982_8.cuff is the name to give to the cufflinks output directory.
This needed about 5000 Mbyte of RAM to run on my bam files (which were each about 3-5 Gbyte), and took about 40 mins.
In the cufflinks output directory for a bam file, there will be output files: genes.fpkm_tracking, isoforms.fpkm_tracking, skipped.gtf, transcripts.gtf. The transcripts.gtf file looks like this:
Sratt_Chr00_000001 Cufflinks transcript 17410 17463 1000 . . gene_id "CUFF.1"; transcript_id "CUFF.1.1"; FPKM "1511144.7530642648"; frac "1.000000"; conf_lo "555412.894285"; conf_hi "2466876.611843"; cov "2020888.597041";
Sratt_Chr00_000001 Cufflinks exon 17410 17463 1000 . . gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "1"; FPKM "1511144.7530642648"; frac "1.000000"; conf_lo "555412.894285"; conf_hi "2466876.611843"; cov "2020888.597041
";
Sratt_Chr00_000001 Cufflinks transcript 20998 21381 1000 . . gene_id "CUFF.2"; transcript_id "CUFF.2.1"; FPKM "28.5663833221"; frac "1.000000"; conf_lo "20.404560"; conf_hi "36.728207"; cov "49.897118";
Sratt_Chr00_000001 Cufflinks exon 20998 21381 1000 . . gene_id "CUFF.2"; transcript_id "CUFF.2.1"; exon_number "1"; FPKM "28.5663833221"; frac "1.000000"; conf_lo "20.404560"; conf_hi "36.728207"; cov "49.897118";
Sratt_Chr00_000001 Cufflinks transcript 17740 18274 1000 - . gene_id "CUFF.3"; transcript_id "CUFF.3.1"; FPKM "104.2588346550"; frac "1.000000"; conf_lo "94.306829"; conf_hi "114.210840"; cov "272.185806";
Sratt_Chr00_000001 Cufflinks exon 17740 17854 1000 - . gene_id "CUFF.3"; transcript_id "CUFF.3.1"; exon_number "1"; FPKM "104.2588346550"; frac "1.000000"; conf_lo "94.306829"; conf_hi "114.210840"; cov "272.185806";
Sratt_Chr00_000001 Cufflinks exon 17905 18274 1000 - . gene_id "CUFF.3"; transcript_id "CUFF.3.1"; exon_number "2"; FPKM "104.2588346550"; frac "1.000000"; conf_lo "94.306829"; conf_hi "114.210840"; cov "272.185806";
...
(iv) Merge the transcript assemblies using cuffmerge, eg.
% /nfs/helminths/users/jit/bin/cufflinks-2.0.2.Linux_x86_64/cuffmerge -p 8 -s SRATT.V5.fa assemblies.txt
where -p8 tells cuffmerge to run the job on 8 processors, SRATT.V5.fa is the assembly file for the genome, and assemblies.txt is a file containing a list of the gtf files that are made by cufflinks in the previous step, eg.:
4982_8.cuff/transcripts.gtf
5287_6.cuff/transcripts.gtf
7007_1.cuff/transcripts.gtf
...
I found that cuffmerge took about 5 mins to run, and used 50 Mbyte of RAM. It made an output directory 'merged_asm'. This directory has a file 'merged.gtf' that looks like this:
Sratt_Chr00_000001 Cufflinks exon 10333 10728 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.7.1"; tss_id "TSS1";
Sratt_Chr00_000001 Cufflinks exon 10765 10837 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; oId "CUFF.7.1"; tss_id "TSS1";
Sratt_Chr00_000001 Cufflinks exon 10888 11226 . + . gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; oId "CUFF.7.1"; tss_id "TSS1";
Sratt_Chr00_000001 Cufflinks exon 12003 12301 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.6.1"; tss_id "TSS2";
Sratt_Chr00_000001 Cufflinks exon 12341 12413 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.6.1"; tss_id "TSS2";
Sratt_Chr00_000001 Cufflinks exon 12463 12782 . + . gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "3"; oId "CUFF.6.1"; tss_id "TSS2";
Sratt_Chr00_000001 Cufflinks exon 14215 14561 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.4.1"; tss_id "TSS3";
Sratt_Chr00_000001 Cufflinks exon 14608 14680 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.4.1"; tss_id "TSS3";
Sratt_Chr00_000001 Cufflinks exon 14731 15042 . + . gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "3"; oId "CUFF.4.1"; tss_id "TSS3";
Sratt_Chr00_000001 Cufflinks exon 15538 15874 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.3.1"; tss_id "TSS4";
Sratt_Chr00_000001 Cufflinks exon 15916 15988 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00000004"; exon_number "2"; oId "CUFF.3.1"; tss_id "TSS4";
Sratt_Chr00_000001 Cufflinks exon 16037 16178 . + . gene_id "XLOC_000004"; transcript_id "TCONS_00000004"; exon_number "3"; oId "CUFF.3.1"; tss_id "TSS4";
...
This file has isoforms merged from the different input files, as explained in the
cufflinks manual.
[Note: (An aside) My colleague Bhavana Harsha pointed out to me that cuffmerge produces an intersection of the transcripts in the input gtf files. In contrast, there is another program, cuffcompare, that produces the union of the transcripts in the input gtf file, ie. a non-redundant set of transcripts across all input files with a single representative transcript chosen for each clique of matching transcripts across samples. Bhavana pointed me to
this discussion of the differences between cuffmerge and cuffcompare - thanks Bhavana!]
(v) Use the cuffmerge output to make exon 'hints' for Augustus, eg.
% cd merged_asm
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/Gene_models/cufflinkGTF2augustusExonParthints.pl merged.gtf
where the script cufflinkGTF2augustusExonParthints.pl was written by my colleague Jason Tsai.
This made an output file 'merged.gtf.augustusexonpart.gff'.
This took about 5 sec to run, and needed 2 Mbyte of RAM.
The output file looks like this:
Sratt_Chr00_000001 b2h ep 10333 10728 . + . grp=TCONS_00000001;pri=4;src=W
Sratt_Chr00_000001 b2h ep 10765 10837 . + . grp=TCONS_00000001;pri=4;src=W
Sratt_Chr00_000001 b2h ep 10888 11226 . + . grp=TCONS_00000001;pri=4;src=W
Sratt_Chr00_000001 b2h ep 12003 12301 . + . grp=TCONS_00000002;pri=4;src=W
Sratt_Chr00_000001 b2h ep 12341 12413 . + . grp=TCONS_00000002;pri=4;src=W
Sratt_Chr00_000001 b2h ep 12463 12782 . + . grp=TCONS_00000002;pri=4;src=W
Sratt_Chr00_000001 b2h ep 14215 14561 . + . grp=TCONS_00000003;pri=4;src=W
Sratt_Chr00_000001 b2h ep 14608 14680 . + . grp=TCONS_00000003;pri=4;src=W
Sratt_Chr00_000001 b2h ep 14731 15042 . + . grp=TCONS_00000003;pri=4;src=W
Sratt_Chr00_000001 b2h ep 15538 15874 . + . grp=TCONS_00000004;pri=4;src=W
Sratt_Chr00_000001 b2h ep 15916 15988 . + . grp=TCONS_00000004;pri=4;src=W
Sratt_Chr00_000001 b2h ep 16037 16178 . + . grp=TCONS_00000004;pri=4;src=W
Sratt_Chr00_000001 b2h ep 32665 35439 . + . grp=TCONS_00000005;pri=4;src=W
Sratt_Chr00_000001 b2h ep 35486 35691 . + . grp=TCONS_00000005;pri=4;src=W
Sratt_Chr00_000001 b2h ep 35741 36281 . + . grp=TCONS_00000005;pri=4;src=W
Sratt_Chr00_000001 b2h ep 36646 37662 . + . grp=TCONS_00000006;pri=4;src=W
Sratt_Chr00_000001 b2h ep 37717 38464 . + . grp=TCONS_00000006;pri=4;src=W
Sratt_Chr00_000001 b2h ep 38505 38668 . + . grp=TCONS_00000006;pri=4;src=W
...
Again, 'pri=4' is the priority of the hint, but 'src=W' means the hint comes from RNA-seq data. The 'ep' in the third column mean that it is an exonpart hint. The 'grp=' is used to show that certain exonpart hints were generated by alignment of the same cufflinks transcript, and so probably come from the same transcript (see the
Augustus README.TXT).
(vi) Merge the intron and exon 'hints' files, eg.
% cat merged.intronhints.gff merged_asm/merged.gtf.augustusexonpart.gff >
merged.intronhints.cuffmergeEPhint.gff
Making hints files for Augustus using exonerate alignments of ESTs and proteins
As well as using RNA-seq alignments to make hints for Augustus, we can also use alignments of ESTs and proteins. I have a script run_exonerate_after_blast.pl (described
in my blog here) that can be used to align a file of proteins, or a file of ESTs, to the regions of a genome assembly that have BLAST matches to the proteins/ESTs.
If you used this to align proteins to a genome, you can then convert the output into hints for Augustus using my script make_exonerate_hints_for_augustus.pl:
% perl -w make_exonerate_hints_for_augustus.pl exonerate1.gff exonerate1.gff_hints prot
where exonerate1.gff was the output of run_exonerate_after_blast.pl, and exonerate1.gff_hints is the output file.
Likewise, if you used run_exonerate_after_blast.pl to align ESTs to a genome, you can convert the output into hints for Augustus using:
% perl -w make_exonerate_hints_for_augustus.pl exonerate2.gff exonerate2.gff_hints est
You can then concatenate the different hints files to give to Augustus:
% cat exonerate1.gff_hints exonerate2.gff_hints > all_hints
You can concatenate this with RNA-seq hints too, if you have them.
Note that the make_exonerate_hints_for_augustus.pl script labels the protein hints as source=P, and the EST hints as source=E. It infers 'exonpart' hints and 'intron' hints from the EST alignments, and 'CDSpart' and 'intron' hints from the protein alignments. The whole exon/CDS from exonerate is not used for an exonpart/CDSpart hint; 9-bp is subtracted from each end, to take into account that the ends of the boundaries of the exonerate exon/CDS features might not be 100% accurate. If you want to use these hints in Augustus, you will have to have weights for bonuses and maluses for these sources of evidence in the Augustus configuration file (.cfg file, see below).
Running Augustus
I used the following steps to run Augustus, based on helpful advice from Jason Tsai and Magdalena Zarowiecki:
(i) Edit the parameters settings file
There is a lot of useful information about this file
here. By default, a file config/extrinsic/extrinsic.M.RM.E.W.cfg comes with Augustus, that defines what are the 'bonuses' and 'maluses' to give to each type of 'hint'. You can edit this file if you like. At the top of the file it gives the data sources that are being used, for example:
[SOURCES]
M RM E W
where 'M' = manual annotation, 'R' = retroposed genes, 'E' = EST/cDNA database hit, 'W' = RNA-Seq coverage information. If you wanted to you could add extra source types to the '[SOURCES]' part of your configuration file, eg. 'P' for protein database hit. (Note: I think 'RM' means repeat-masking here.)
In my case I used the file /nfs/helminths/users/jit/bin/augustus.2.6.1/extrinsic.M.RM.E.W.cfg that was edited by my colleague Jason Tsai. The difference between this file and the default one that comes with Augustus is the line that says what is the bonus/malus for an 'exonpart' hint. Jason's file says:
exonpart 1 .992 M 1 1e+100 RM 1 1 E 1 1 W 1 1e5
The default file that comes with Augustus says:
exonpart 1 .992 M 1 1e+100 RM 1 1 E 1 1 W 1 1.005
Note the second and third columns give the bonus and malus for an 'exonpart' hint, 1 is the bonus and 0.992 is the malus. The bonus is given to a predicted feature that obeys the hint, while the malus is a penalty given to a predicted feature not supported by any hint. A bonus of 1 effectively means no bonus, while a malus of 1 effectively means no penalty. An 'exonpart' bonus only applies to the exons that contain the interval from the hint (just overlapping means no bonus is given). The Augustus
README.TXT suggests that a malus of close to 1 should be used, eg. 0.99 (here 0.992 is used).
Starting in column 4, you can tell Augustus how to modify the bonus depending on the source of the hint. The format for this part is:
source malus_factor bonus_factor
where malus_factor is the factor a malus will be multiplied by, for hints of this source,
bonus_factor is the factor a bonus will be multiplied by, for hints of this source.
This allows us to multiply the 'exonpart' bonuses for exonpart hints from one source (eg. RNA-seq) by a different factor compared to exonpart hints from another source (eg. protein alignment).
The two files differ in the bonus and malus for the RNA-seq coverage information (the 'W' columns). In Jason's file, he has 1e5, which means that the bonus for 'exonpart' hints from RNA-seq data are multipled by 1e5. This means that they are given a lot of weight. That is, predicted exons that contain the interval from an RNA-seq hint are given a lot of weight, and likely to be used in an Augustus gene prediction.
There is more information about this parameter settings file
here.
To add another source of information, eg. 'P' for protein alignments, we can change the top of the config file to be:
[SOURCES]
M RM E W P
Then we have to add weights for the bonuses and maluses for these types of alignments to the '[GENERAL]' part of the file:
exonpart 1 .992 M 1 1e+100 RM 1 1 E 1 1 W 1 1e5 P 1 1
intron 1 .34 M 1 1e+100 RM 1 1 E 1 1e5 W 1 1 P 1 1e3
CDSpart 1 1 0.985 M 1 1e+100 RM 1 1 E 1 1 W 1 1 P 1 1e3
This will give factor 1e5 to introns from RNA-seq or EST alignments, and 1e3 to introns from protein alignment.
(ii) Run Augustus using the 'hints', eg.
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus SRATT.V5.fa --AUGUSTUS_CONFIG_PATH=/nfs/helminths/users/jit/bin/augustus.2.6.1/config/ --extrinsicCfgFile=/nfs/helminths/users/jit/bin/augustus.2.6.1/extrinsic.M.RM.E.W.cfg --species=Sratti.final --alternatives-from-evidence=true --hintsfile=merged.intronhints.cuffmergeEPhint.gff --allow_hinted_splicesites=atac --min_intron_len=15
where /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus is the installed version of Augustus (2.6.1 here), SRATT.V5.fa is the genome assembly fasta file, /nfs/helminths/users/jit/bin/augustus.2.6.1/config/ is the directory containing the trained versions of Augustus, /nfs/helminths/users/jit/bin/augustus.2.6.1/extrinsic.M.RM.E.W.cfg is the file with the bonuses and maluses for hints, Sratti.final is the name of the trained version of Augustus, --alternatives-from-evidence=true means that alternatives splices are predicted if they are supported by evidence, merged.intronhints.cuffmergeEPhint.gff is the hints file, --allow_hinted_splicesites=atac lets Augustus predicted AC--AC introns (as well as GT--AG introns), -min_intron_len=15 specifies that introns should be at least 15 bp.
In my case this took about 5 hours to run, and needed about 2000 Mbyte of RAM.
The output from Augustus is written to standard output [note to self: this goes in the bsub .o output].
(iii) Run Augustus without 'hints'. This was suggested by my colleagues Magdalena Zarowiecki and Jason Tsai, who found that Augustus predicts few genes that are not supported by hints when you run it in the 'hints' mode. You can run Augustus without hints like this:
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus SRATT.V5.fa --AUGUSTUS_CONFIG_PATH=/nfs/helminths/users/jit/bin/augustus.2.6.1/config/ --extrinsicCfgFile=/nfs/helminths/users/jit/bin/augustus.2.6.1/extrinsic.M.RM.E.W.cfg --species=Sratti.final --alternatives-from-evidence=true --allow_hinted_splicesites=atac --min_intron_len=15
In my case this took about 1.5 hours and needed 1000 Mbyte of RAM.
Post-processing the Augustus output:
(i) Make a gff file from the Augustus output file run using hints, eg.
% grep AUGUSTUS bsub_hints.o > Sratti.hints.augustus.gff
where bsub_hints.o is the standard output file from the Augustus job with hints [run on the Sanger farm in my case].
(ii) Make a gff file from the Augustus output file run without hints, eg.
% grep AUGUSTUS bsub_nohints.o > Sratti.nohints.augustus.gff
where bsub_nohints.o is the standard output from the Augustus job without hints.
(iii) Combine the predictions with and without hints:
To do this, we first make files with just the coordinates of predicted genes, eg.
% grep AUGUSTUS bsub.o | grep -v transcript > Sratti.hints.augustus.genes.gff
% grep AUGUSTUS bsub.o | grep -v transcript > Sratti.nohints.augustus.genes.gff
Then we can find genes in Sratti.nohints.augustus.genes.gff that don't overlap any gene in Sratti.hints.augustus.genes.gff using
BEDTools, eg.:
% /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/bin/intersectBed -v -a Sratti.nohints.augustus.gff -b Sratti.hints.augustus.gff > Sratti.nonoverlapping.gff
Then we need to add back the 'CDS', 'gene', 'intron', 'start_codon', 'stop_codon' and 'transcript' lines:
% perl -w add_back_lines.pl bsub_hints.o Sratti.hints.augustus.gff > Sratti.hints.augustus2.gff
% perl -w add_back_lines.pl bsub_nohints.o Sratti.nonoverlapping.gff > Sratti.nonoverlapping2.gff
where add_back_lines.pl is a script that I wrote to do this.
Then we need to renumber the genes in Sratti.nonoverlapping2.gff. Gene numbers are up to 12154 in Sratti.hints.augustus2.gff. So need to number genes from 12155 upwards in Sratti.nonoverlapping2.gff:
% perl -w renumber_genes.pl Sratti.nonoverlapping2.gff 12154 > Sratti.nonoverlapping3.gff
where renumber_genes.pl is a script that I wrote to do this.
Now add together the files Sratti.hints.augustus2.gff and Sratti.nonoverlapping3.gff:
% cat Sratti.hints.augustus2.gff Sratti.nonoverlapping3.gff > Sratti.gff
(iv) Infer the spliced transcript sequences, and protein sequences from the gff:
We can infer the spliced transcript sequences from the gff using:
% perl -w get_spliced_transcripts_from_gff.pl Sratti.genes.gff SRATT.V5.fa Sratti.spliced.fa . no no yes
where
get_spliced_transcripts_from_gff.pl is a Perl script that I wrote to do this, SRATT.V5.fa is the assembly fasta file, and Sratti.spliced.fa is the output file of spliced transcript sequences.
Then we can translate the transcripts:
% perl -w translate_spliced_dna.pl Sratti.spliced.fa Sratti.translated.fa .
where
translate_spliced_dna.pl is a Perl script that I wrote, and Sratti.translated.fa is the output file of protein sequences.
(v) Make embl files from the Augustus gff:
If you want to view the Augustus predictions in
Artemis, you can do this by first converting the embl gff file to embl files (one per scaffold):
% perl -w gff_to_embl.pl Sratti.genes.gff SRATT.V5.fa . no no no
where
gff_to_embl.pl is a script that I wrote to do this, and SRATT.V5.fa is the genome fasta file.
This will make one embl file per scaffold. You can then open one of the embl files in Artemis by typing:
% art Sratt_Chr00_000001.embl [note to self: first need to do: 'ssh -Y pcs4']
If there are two alternative splice forms of the same gene, these will appear as two separate 'genes' in Artemis, for example:
and
Other Augustus options
% augustus --paramlist : give full list of Augustus options
% augustus --genemodel=partial : allow prediction of partial genes [default]
% augustus --complete : only predict complete genes
% augustus --singlestrand=true : predicts genes on each strand independently, allowing overlapping genes between the two strands [by default, does not allow overlapping genes on opposite strands].
% augustus --noInFrameStop=true : do not report transcripts that contain internal stop codons. By default, Augustus doesn't have this option turned on, and this means that it sometimes reports transcripts that contain internal stop codons, that occur across exon-exon boundaries.
Thank you to Magdalena Zarowiecki and Jason Tsai, who have both helped me a lot with Augustus.