Friday, 2 December 2016

GTF versus GFF

The GFF format is described here.
The entry for one gene is something like this:
# Gene gene:HmN_000672600
pathogens_HYM_scaffold_0001     WormBase_imported       gene    599     742     .       +       .       ID=gene:HmN_000672600;Name=HmN_000672600;biotype=protein_coding
pathogens_HYM_scaffold_0001     WormBase_imported       mRNA    599     742     .       +       .       ID=transcript:HmN_000672600.1;Parent=gene:HmN_000672600;Name=HmN_000672600.1
pathogens_HYM_scaffold_0001     WormBase_imported       exon    599     742     .       +       .       ID=exon:HmN_000672600.1.1;Parent=transcript:HmN_000672600.1
pathogens_HYM_scaffold_0001     WormBase_imported       CDS     599     742     .       +       0       ID=cds:HmN_000672600.1;Parent=transcript:HmN_000672600.1

You can see that the transcript, exon and CDS lines have a 'Parent=' tag at the end, to say which transcript/gene they belong to. Also, there are 'gene', 'mRNA', 'exon', and 'CDS' lines.

The GTF format is slightly different. An example is: (for the same gene as above)
pathogens_HYM_scaffold_0001     WormBase_imported       transcript      599     742     .       +       .       gene_id "gene:HmN_000672600"; transcript_id "transcript:HmN_000672600.1"
pathogens_HYM_scaffold_0001     WormBase_imported       exon    599     742     .       +       .       gene_id "HmN_000672600.1"; transcript_id "transcript:HmN_000672600.1"; exon_number "1"

Here we just have 'transcript' and 'exon' lines, and the tag at the end gives the gene_id and transcript_id, so there is no need for 'Parent=' tags.
Here is a description of GTF: here.

Converting GFF to GTF:
My colleague Adam Reid has written a nice Perl script for convering GFF to GTF, eg.:
% perl /nfs/users/nfs_a/ar11/scripts/gff3wormbase2gtf.pl  emultilocularis.gff3 > emultilocularis.gtf

No comments: