Thursday 16 May 2013

Running the Augustus gene-finding software

Augustus
Augustus is a gene-finding software based on Hidden Markov Models (HMMs), described in papers by Stanke and Waack (2003) and Stanke et al (2006) and Stanke et al (2006b) and Stanke et al (2008). It has several interesting features:
- a Hidden Markov Model (HMM) is used for gene prediction, containing states for introns, exons, splice sites, etc. The emission probabilities for bases for each state, and the transition probabilities between states are estimated based on a training set of annotated sequences for a species. For a given DNA sequence, Augustus computes the sequence of states and emission lengths that is most likely.
- a way of modelling intron lengths: it models the distribution of short introns (which tend to cluster around a certain length) very accurately, and uses a geometric distribution only for lengths of long introns.
- uses the empirical distribution as the probabilistic model for donor splice sites, and for acceptor splice sites. For the donor splice site, the empirical distribution is smoothed in a way that takes into account that patterns 'similar' to a frequent splice site pattern often also are frequent splice site patterns.
- a model for a short region directly upstream (-8 to -4) of the donor splice site that takes the reading frame of the exon into account.
- a method to train the model parameters that depends on the GC-content of the sequence.

Download Augustus
I have used Augustus 2.6.1 for the following. You should download the latest version from the Augustus website.
 
Training Augustus
In order to run Augustus, it is a good idea to first train Augustus for your species. Once you have done this, you will then have the trained version of Augustus in a particular folder 
[Note to self: I am using: /nfs/helminths/users/jit/bin/augustus.2.6.1/config/species/Sratti.final/].


Making hints files for Augustus using TopHat and Cufflinks
As explained in Stanke et al (2008), there are 16 types of 'hints' that Augustus can take: start (translation start), stop (translation stop), tss (transcription start), tts (transcription termination), ass (acceptor splice site), dss (donor splice site), exon (exact exon), exonpart (part of exon), intron (exact intron, in CDS/UTR), intronpart (part of an intron), CDS (exact coding part of exon), CDSpart (part of a coding part of exon), UTR (exact UTR), UTRpart (part of a UTR), irpart (part of an intergenic region), and nonexonpart (part of an intergenic region or intron).

A 'hint' refers to a region, and can have strand and reading frame information, and a score, as well as information on the source of evidence. The file containing hints of all types are concatenated into one 'hints' file for Augustus. Hints are 'grouped' so that all hints from one group are thought to belong to the same transcript. 

There is a nice webpage on the Augustus website about using RNA-seq data.

My colleague Jason Tsai suggested a protocol for making a 'hints' file for Augustus with 'intron' and 'exonpart' hints, based on ideas in a paper by Trapnell et al (2012). These are the steps, assuming you start with a set of bam files of mapped RNA-seq data (one file for each RNA-seq sample):

(i) Merge your bam files into one bam, eg.
% samtools merge merged.bam 4982_8.bam  5287_6.bam  7007_1.bam
where merged.bam is the merged bam file, and the sample bam files are 4982_8.bam, 5287_6.bam & 7007_1.bam.
In my case, for 15 bam files, this took about 3 hours and used about 15 Mbyte of RAM.

(ii) Make intron 'hints' for Augustus, using the bam2hints program that comes with Augustus, eg.
% /nfs/users/nfs_a/alc/Documents/bin/augustus/auxprogs/bam2hints/bam2hints --intronsonly --minintronlen=15 --in=merged.bam --out=merged.intronhints.gff
where /nfs/users/nfs_a/alc/Documents/bin/augustus is where Augustus was installed, merged.bam is your merged bam file,  merged.intronhints.gff is the output file of intron hints.
In my case, this took about 45 mins to run, and needed 400 Mbyte of RAM.

The merged.intronhints.gff file looks like this:
Sratt_Chr00_000001      b2h     intron  2591    3102    0       .       .       pri=4;src=E
Sratt_Chr00_000001      b2h     intron  4191    4247    0       .       .       pri=4;src=E
Sratt_Chr00_000001      b2h     intron  4600    4635    0       .       .       mult=1691;pri=4;src=E
Sratt_Chr00_000001      b2h     intron  4600    5030    0       .       .       mult=130;pri=4;src=E
Sratt_Chr00_000001      b2h     intron  8601    8649    0       .       .       pri=4;src=E
Sratt_Chr00_000001      b2h     intron  10729   10764   0       .       .       mult=182;pri=4;src=E
Sratt_Chr00_000001      b2h     intron  10729   10781   0       .       .       pri=4;src=E

...
There seems to be 'pri=4' on every line of the file, this gives the priority of the hint (see the augustus README.TXT). The 'mult=x' value gives the number of reads that supported the intron. 'src=E' means the data comes from ESTs (see here).

(iii) Make transcript assemblies based on your RNA-seq data, using cufflinks. In the Trapnell et al (2012) paper, it is recommended to make a transcript assembly for each sample (bam file), and then afterwards merge the assemblies from the different samples using cuffmerge. This is partly because each sample may have a different mix of transcripts (especially if they are for different conditions), but also because it is computationally intensive to make the transcriptome assembly with cufflinks. We can make a transcript assembly for a sample using a command like this:
% /nfs/helminths/users/jit/bin/cufflinks-2.0.2.Linux_x86_64/cufflinks --min-intron-length 15 -p 8 -o 4982_8.cuff 4982_8.bam
where -p8 tells cufflinks to run the job on 8 processors, 4982_8.bam is the input bam file, 4982_8.cuff is the name to give to the cufflinks output directory.

This needed about 5000 Mbyte of RAM to run on my bam files (which were each about 3-5 Gbyte), and took about 40 mins.

In the cufflinks output directory for a bam file, there will be output files: genes.fpkm_tracking, isoforms.fpkm_tracking, skipped.gtf, transcripts.gtf. The transcripts.gtf file looks like this:
Sratt_Chr00_000001      Cufflinks       transcript      17410   17463   1000    .       .       gene_id "CUFF.1"; transcript_id "CUFF.1.1"; FPKM "1511144.7530642648"; frac "1.000000"; conf_lo "555412.894285"; conf_hi "2466876.611843"; cov "2020888.597041";
Sratt_Chr00_000001      Cufflinks       exon    17410   17463   1000    .       .       gene_id "CUFF.1"; transcript_id "CUFF.1.1"; exon_number "1"; FPKM "1511144.7530642648"; frac "1.000000"; conf_lo "555412.894285"; conf_hi "2466876.611843"; cov "2020888.597041
";
Sratt_Chr00_000001      Cufflinks       transcript      20998   21381   1000    .       .       gene_id "CUFF.2"; transcript_id "CUFF.2.1"; FPKM "28.5663833221"; frac "1.000000"; conf_lo "20.404560"; conf_hi "36.728207"; cov "49.897118";
Sratt_Chr00_000001      Cufflinks       exon    20998   21381   1000    .       .       gene_id "CUFF.2"; transcript_id "CUFF.2.1"; exon_number "1"; FPKM "28.5663833221"; frac "1.000000"; conf_lo "20.404560"; conf_hi "36.728207"; cov "49.897118";
Sratt_Chr00_000001      Cufflinks       transcript      17740   18274   1000    -       .       gene_id "CUFF.3"; transcript_id "CUFF.3.1"; FPKM "104.2588346550"; frac "1.000000"; conf_lo "94.306829"; conf_hi "114.210840"; cov "272.185806";
Sratt_Chr00_000001      Cufflinks       exon    17740   17854   1000    -       .       gene_id "CUFF.3"; transcript_id "CUFF.3.1"; exon_number "1"; FPKM "104.2588346550"; frac "1.000000"; conf_lo "94.306829"; conf_hi "114.210840"; cov "272.185806";
Sratt_Chr00_000001      Cufflinks       exon    17905   18274   1000    -       .       gene_id "CUFF.3"; transcript_id "CUFF.3.1"; exon_number "2"; FPKM "104.2588346550"; frac "1.000000"; conf_lo "94.306829"; conf_hi "114.210840"; cov "272.185806";
...


(iv) Merge the transcript assemblies using cuffmerge, eg.
% /nfs/helminths/users/jit/bin/cufflinks-2.0.2.Linux_x86_64/cuffmerge -p 8 -s SRATT.V5.fa assemblies.txt
where -p8 tells cuffmerge to run the job on 8 processors, SRATT.V5.fa is the assembly file for the genome, and assemblies.txt is a file containing a list of the gtf files that are made by cufflinks in the previous step, eg.:
4982_8.cuff/transcripts.gtf
5287_6.cuff/transcripts.gtf
7007_1.cuff/transcripts.gtf

...
I found that cuffmerge took about 5 mins to run, and used 50 Mbyte of RAM. It made an output directory 'merged_asm'. This directory has a file 'merged.gtf' that looks like this:
Sratt_Chr00_000001      Cufflinks       exon    10333   10728   .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "1"; oId "CUFF.7.1"; tss_id "TSS1";
Sratt_Chr00_000001      Cufflinks       exon    10765   10837   .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "2"; oId "CUFF.7.1"; tss_id "TSS1";
Sratt_Chr00_000001      Cufflinks       exon    10888   11226   .       +       .       gene_id "XLOC_000001"; transcript_id "TCONS_00000001"; exon_number "3"; oId "CUFF.7.1"; tss_id "TSS1";
Sratt_Chr00_000001      Cufflinks       exon    12003   12301   .       +       .       gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "1"; oId "CUFF.6.1"; tss_id "TSS2";
Sratt_Chr00_000001      Cufflinks       exon    12341   12413   .       +       .       gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "2"; oId "CUFF.6.1"; tss_id "TSS2";
Sratt_Chr00_000001      Cufflinks       exon    12463   12782   .       +       .       gene_id "XLOC_000002"; transcript_id "TCONS_00000002"; exon_number "3"; oId "CUFF.6.1"; tss_id "TSS2";
Sratt_Chr00_000001      Cufflinks       exon    14215   14561   .       +       .       gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "1"; oId "CUFF.4.1"; tss_id "TSS3";
Sratt_Chr00_000001      Cufflinks       exon    14608   14680   .       +       .       gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "2"; oId "CUFF.4.1"; tss_id "TSS3";
Sratt_Chr00_000001      Cufflinks       exon    14731   15042   .       +       .       gene_id "XLOC_000003"; transcript_id "TCONS_00000003"; exon_number "3"; oId "CUFF.4.1"; tss_id "TSS3";
Sratt_Chr00_000001      Cufflinks       exon    15538   15874   .       +       .       gene_id "XLOC_000004"; transcript_id "TCONS_00000004"; exon_number "1"; oId "CUFF.3.1"; tss_id "TSS4";
Sratt_Chr00_000001      Cufflinks       exon    15916   15988   .       +       .       gene_id "XLOC_000004"; transcript_id "TCONS_00000004"; exon_number "2"; oId "CUFF.3.1"; tss_id "TSS4";
Sratt_Chr00_000001      Cufflinks       exon    16037   16178   .       +       .       gene_id "XLOC_000004"; transcript_id "TCONS_00000004"; exon_number "3"; oId "CUFF.3.1"; tss_id "TSS4";
...

This file has isoforms merged from the different input files, as explained in the cufflinks manual.

[Note: (An aside) My colleague Bhavana Harsha pointed out to me that cuffmerge produces an intersection of the transcripts in the input gtf files. In contrast, there is another program, cuffcompare, that produces the union of the transcripts in the input gtf file, ie. a non-redundant set of transcripts across all input files with a single representative transcript chosen for each clique of matching transcripts across samples. Bhavana pointed me to this discussion of the differences between cuffmerge and cuffcompare - thanks Bhavana!]

(v) Use the cuffmerge output to make exon 'hints' for Augustus, eg.
% cd merged_asm
% /nfs/users/nfs_j/jit/repository/pathogen/user/jit/Gene_models/cufflinkGTF2augustusExonParthints.pl merged.gtf
where the script cufflinkGTF2augustusExonParthints.pl was written by my colleague Jason Tsai.
This made an output file 'merged.gtf.augustusexonpart.gff'.
This took about 5 sec to run, and needed 2 Mbyte of RAM.
The output file looks like this:
Sratt_Chr00_000001      b2h     ep      10333   10728   .       +       .       grp=TCONS_00000001;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      10765   10837   .       +       .       grp=TCONS_00000001;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      10888   11226   .       +       .       grp=TCONS_00000001;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      12003   12301   .       +       .       grp=TCONS_00000002;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      12341   12413   .       +       .       grp=TCONS_00000002;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      12463   12782   .       +       .       grp=TCONS_00000002;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      14215   14561   .       +       .       grp=TCONS_00000003;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      14608   14680   .       +       .       grp=TCONS_00000003;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      14731   15042   .       +       .       grp=TCONS_00000003;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      15538   15874   .       +       .       grp=TCONS_00000004;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      15916   15988   .       +       .       grp=TCONS_00000004;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      16037   16178   .       +       .       grp=TCONS_00000004;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      32665   35439   .       +       .       grp=TCONS_00000005;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      35486   35691   .       +       .       grp=TCONS_00000005;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      35741   36281   .       +       .       grp=TCONS_00000005;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      36646   37662   .       +       .       grp=TCONS_00000006;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      37717   38464   .       +       .       grp=TCONS_00000006;pri=4;src=W
Sratt_Chr00_000001      b2h     ep      38505   38668   .       +       .       grp=TCONS_00000006;pri=4;src=W 

...
Again, 'pri=4' is the priority of the hint, but 'src=W' means the hint comes from RNA-seq data. The 'ep' in the third column mean that it is an exonpart hint. The 'grp=' is used to show that certain exonpart hints were generated by alignment of the same cufflinks transcript, and so probably come from the same transcript (see the Augustus README.TXT).

(vi) Merge the intron and exon 'hints' files, eg.
% cat merged.intronhints.gff merged_asm/merged.gtf.augustusexonpart.gff >
merged.intronhints.cuffmergeEPhint.gff

Making hints files for Augustus using exonerate alignments of ESTs and proteins
As well as using RNA-seq alignments to make hints for Augustus, we can also use alignments of ESTs and proteins. I have a script run_exonerate_after_blast.pl (described in my blog here) that can be used to align a file of proteins, or a file of ESTs, to the regions of a genome assembly that have BLAST matches to the proteins/ESTs. 

If you used this to align proteins to a genome, you can then convert the output into hints for Augustus using my script make_exonerate_hints_for_augustus.pl:
% perl -w make_exonerate_hints_for_augustus.pl exonerate1.gff exonerate1.gff_hints prot
where exonerate1.gff was the output of run_exonerate_after_blast.pl, and exonerate1.gff_hints is the output file.

Likewise, if you used run_exonerate_after_blast.pl to align ESTs to a genome, you can convert the output into hints for Augustus using:
% perl -w make_exonerate_hints_for_augustus.pl exonerate2.gff exonerate2.gff_hints est

You can then concatenate the different hints files to give to Augustus:
% cat exonerate1.gff_hints exonerate2.gff_hints > all_hints
You can concatenate this with RNA-seq hints too, if you have them. 

Note that the make_exonerate_hints_for_augustus.pl script labels the protein hints as source=P, and the EST hints as source=E. It infers 'exonpart' hints and 'intron' hints from the EST alignments, and 'CDSpart' and 'intron' hints from the protein alignments. The whole exon/CDS from exonerate is not used for an exonpart/CDSpart hint; 9-bp is subtracted from each end, to take into account that the ends of the boundaries of the exonerate exon/CDS features might not be 100% accurate. If you want to use these hints in Augustus, you will have to have weights for bonuses and maluses for these sources of evidence in the Augustus configuration file (.cfg file, see below).

Running Augustus
I used the following steps to run Augustus, based on helpful advice from Jason Tsai and Magdalena Zarowiecki:

(i) Edit the parameters settings file
There is a lot of useful information about this file here. By default, a file config/extrinsic/extrinsic.M.RM.E.W.cfg comes with Augustus, that defines what are the 'bonuses' and 'maluses' to give to each type of 'hint'. You can edit this file if you like. At the top of the file it gives the data sources that are being used, for example:
 [SOURCES]
M RM E W

where 'M' = manual annotation, 'R' = retroposed genes, 'E' = EST/cDNA database hit, 'W' = RNA-Seq coverage information. If you wanted to you could add extra source types to the '[SOURCES]' part of your configuration file, eg. 'P' for protein database hit. (Note: I think 'RM' means repeat-masking here.)

In my case I used the file /nfs/helminths/users/jit/bin/augustus.2.6.1/extrinsic.M.RM.E.W.cfg that was edited by my colleague Jason Tsai. The difference between this file and the default one that comes with Augustus is the line that says what is the bonus/malus for an 'exonpart' hint. Jason's file says:
exonpart        1     .992  M    1  1e+100  RM  1     1    E 1    1    W 1  1e5
The default file that comes with Augustus says:
exonpart        1     .992  M    1  1e+100  RM  1     1    E 1    1    W 1  1.005
Note the second and third columns give the bonus and malus for an 'exonpart' hint, 1 is the bonus and 0.992 is the malus. The bonus is given to a predicted feature that obeys the hint, while the malus is a penalty given to a predicted feature not supported by any hint. A bonus of 1 effectively means no bonus, while a malus of 1 effectively means no penalty. An 'exonpart' bonus only applies to the exons that contain the interval from the hint (just overlapping means no bonus is given). The Augustus README.TXT suggests that a malus of close to 1 should be used, eg. 0.99 (here 0.992 is used).

Starting in column 4, you can tell Augustus how to modify the bonus depending on the source of the hint. The format for this part is:
source      malus_factor      bonus_factor
where malus_factor is the factor a malus will be multiplied by, for hints of this source,
bonus_factor is the factor a bonus will be multiplied by, for hints of this source.
This allows us to multiply the 'exonpart' bonuses for exonpart hints from one source (eg. RNA-seq) by a different factor compared to exonpart hints from another source (eg. protein alignment).

The two files differ in the bonus and malus for the RNA-seq coverage information (the 'W' columns). In Jason's file, he has 1e5, which means that the bonus for 'exonpart' hints from RNA-seq data are multipled by 1e5. This means that they are given a lot of weight. That is, predicted exons that contain the interval from an RNA-seq hint are given a lot of weight, and likely to be used in an Augustus gene prediction.

There is more information about this parameter settings file here.

To add another source of information, eg. 'P' for protein alignments, we can change the top of the config file to be:
[SOURCES]
M RM E W P

Then we have to add weights for the bonuses and maluses for these types of alignments to the '[GENERAL]' part of the file:
   exonpart      1     .992   M    1  1e+100  RM  1     1    E 1    1      W 1  1e5   P 1   1
     intron        1      .34    M     1  1e+100  RM  1     1    E 1    1e5  W 1    1    P 1   1e3
    CDSpart    1  1 0.985  M    1  1e+100  RM  1     1    E 1    1      W 1    1    P 1   1e3

This will give factor 1e5 to introns from RNA-seq or EST alignments, and 1e3 to introns from protein alignment.

(ii) Run Augustus using the 'hints', eg.
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus SRATT.V5.fa --AUGUSTUS_CONFIG_PATH=/nfs/helminths/users/jit/bin/augustus.2.6.1/config/ --extrinsicCfgFile=/nfs/helminths/users/jit/bin/augustus.2.6.1/extrinsic.M.RM.E.W.cfg --species=Sratti.final --alternatives-from-evidence=true --hintsfile=merged.intronhints.cuffmergeEPhint.gff --allow_hinted_splicesites=atac --min_intron_len=15
where /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus is the installed version of Augustus (2.6.1 here), SRATT.V5.fa is the genome assembly fasta file, /nfs/helminths/users/jit/bin/augustus.2.6.1/config/ is the directory containing the trained versions of Augustus, /nfs/helminths/users/jit/bin/augustus.2.6.1/extrinsic.M.RM.E.W.cfg is the file with the bonuses and maluses for hints, Sratti.final is the name of the trained version of Augustus, --alternatives-from-evidence=true means that alternatives splices are predicted if they are supported by evidence, merged.intronhints.cuffmergeEPhint.gff is the hints file, --allow_hinted_splicesites=atac lets Augustus predicted AC--AC introns (as well as GT--AG introns), -min_intron_len=15 specifies that introns should be at least 15 bp.

In my case this took about 5 hours to run, and needed about 2000 Mbyte of RAM.
The output from Augustus is written to standard output [note to self: this goes in the bsub .o output].

(iii) Run Augustus without 'hints'. This was suggested by my colleagues Magdalena Zarowiecki and Jason Tsai, who found that Augustus predicts few genes that are not supported by hints when you run it in the 'hints' mode. You can run Augustus without hints like this:
% /nfs/users/nfs_a/alc/Documents/bin/augustus/bin/augustus SRATT.V5.fa --AUGUSTUS_CONFIG_PATH=/nfs/helminths/users/jit/bin/augustus.2.6.1/config/ --extrinsicCfgFile=/nfs/helminths/users/jit/bin/augustus.2.6.1/extrinsic.M.RM.E.W.cfg --species=Sratti.final --alternatives-from-evidence=true --allow_hinted_splicesites=atac --min_intron_len=15
In my case this took about 1.5 hours and needed 1000 Mbyte of RAM.

Post-processing the Augustus output:
(i) Make a gff file from the Augustus output file run using hints, eg.
% grep AUGUSTUS bsub_hints.o > Sratti.hints.augustus.gff
where bsub_hints.o is the standard output file from the Augustus job with hints [run on the Sanger farm in my case].

(ii) Make a gff file from the Augustus output file run without hints, eg.
% grep AUGUSTUS bsub_nohints.o > Sratti.nohints.augustus.gff
where bsub_nohints.o is the standard output from the Augustus job without hints.

(iii) Combine the predictions with and without hints:
To do this, we first make files with just the coordinates of predicted genes, eg.
% grep AUGUSTUS bsub.o | grep -v transcript > Sratti.hints.augustus.genes.gff
% grep AUGUSTUS bsub.o | grep -v transcript > Sratti.nohints.augustus.genes.gff

Then we can find genes in Sratti.nohints.augustus.genes.gff that don't overlap any gene in Sratti.hints.augustus.genes.gff using BEDTools, eg.:
% /nfs/users/nfs_a/alc/Documents/bin/bedtools-2.17.0/bin/intersectBed -v -a Sratti.nohints.augustus.gff -b Sratti.hints.augustus.gff > Sratti.nonoverlapping.gff

Then we need to add back the 'CDS', 'gene', 'intron', 'start_codon', 'stop_codon' and 'transcript' lines:
% perl -w add_back_lines.pl bsub_hints.o Sratti.hints.augustus.gff > Sratti.hints.augustus2.gff
% perl -w add_back_lines.pl bsub_nohints.o Sratti.nonoverlapping.gff > Sratti.nonoverlapping2.gff
where add_back_lines.pl is a script that I wrote to do this.

Then we need to renumber the genes in Sratti.nonoverlapping2.gff. Gene numbers are up to 12154 in Sratti.hints.augustus2.gff. So need to number genes from 12155 upwards in Sratti.nonoverlapping2.gff:
% perl -w renumber_genes.pl Sratti.nonoverlapping2.gff 12154 > Sratti.nonoverlapping3.gff
where renumber_genes.pl is a script that I wrote to do this.

Now add together the files Sratti.hints.augustus2.gff and Sratti.nonoverlapping3.gff:
% cat Sratti.hints.augustus2.gff Sratti.nonoverlapping3.gff > Sratti.gff

(iv) Infer the spliced transcript sequences, and protein sequences from the gff:
We can infer the spliced transcript sequences from the gff using:
% perl -w get_spliced_transcripts_from_gff.pl Sratti.genes.gff SRATT.V5.fa Sratti.spliced.fa . no no yes
where get_spliced_transcripts_from_gff.pl is a Perl script that I wrote to do this, SRATT.V5.fa is the assembly fasta file, and Sratti.spliced.fa is the output file of spliced transcript sequences.

Then we can translate the transcripts:
% perl -w translate_spliced_dna.pl Sratti.spliced.fa Sratti.translated.fa .
where translate_spliced_dna.pl is a Perl script that I wrote, and Sratti.translated.fa is the output file of protein sequences.

(v) Make embl files from the Augustus gff:
If you want to view the Augustus predictions in Artemis, you can do this by first converting the embl gff file to embl files (one per scaffold):
% perl -w gff_to_embl.pl Sratti.genes.gff SRATT.V5.fa . no no no
where gff_to_embl.pl is a script that I wrote to do this, and SRATT.V5.fa is the genome fasta file.

This will make one embl file per scaffold. You can then open one of the embl files in Artemis by typing:
% art Sratt_Chr00_000001.embl [note to self: first need to do: 'ssh -Y pcs4']
If there are two alternative splice forms of the same gene, these will appear as two separate 'genes' in Artemis, for example:







and








Other Augustus options
% augustus --paramlist  : give full list of Augustus options
% augustus --genemodel=partial : allow prediction of partial genes [default]
% augustus --complete : only predict complete genes
% augustus --singlestrand=true : predicts genes on each strand independently, allowing overlapping genes between the two strands [by default, does not allow overlapping genes on opposite strands].
% augustus --noInFrameStop=true : do not report transcripts that contain internal stop codons. By default, Augustus doesn't have this option turned on, and this means that it sometimes reports transcripts that contain internal stop codons, that occur across exon-exon boundaries.

Thank you to Magdalena Zarowiecki and Jason Tsai, who have both helped me a lot with Augustus.

3 comments:

Anonymous said...

Hi Avril, I am following your steps to run Augustus, I find your blog really helpful.
May I know if you can share the perl script "cufflinkGTF2augustusExonParthints.pl"?
Thanks

Anand said...

Is it possible to change malus value when one is NOT using RNA-Seq as hints?

If yes, in which file would it be that I would be changing malus value? I ask because under /share/apps/augustus-3.0/config/extrinsic
there are multiple files that each contain a place for malus value.

Thank you!

Unknown said...

there is an --outfile=FILE option to specify an output file