The entry for one gene is something like this:
# Gene gene:HmN_000672600
pathogens_HYM_scaffold_0001 WormBase_imported gene 599 742 . + . ID=gene:HmN_000672600;Name=HmN_000672600;biotype=protein_coding
pathogens_HYM_scaffold_0001 WormBase_imported mRNA 599 742 . + . ID=transcript:HmN_000672600.1;Parent=gene:HmN_000672600;Name=HmN_000672600.1
pathogens_HYM_scaffold_0001 WormBase_imported exon 599 742 . + . ID=exon:HmN_000672600.1.1;Parent=transcript:HmN_000672600.1
pathogens_HYM_scaffold_0001 WormBase_imported CDS 599 742 . + 0 ID=cds:HmN_000672600.1;Parent=transcript:HmN_000672600.1
You can see that the transcript, exon and CDS lines have a 'Parent=' tag at the end, to say which transcript/gene they belong to. Also, there are 'gene', 'mRNA', 'exon', and 'CDS' lines.
The GTF format is slightly different. An example is: (for the same gene as above)
pathogens_HYM_scaffold_0001 WormBase_imported transcript 599 742 . + . gene_id "gene:HmN_000672600"; transcript_id "transcript:HmN_000672600.1"
pathogens_HYM_scaffold_0001 WormBase_imported exon 599 742 . + . gene_id "HmN_000672600.1"; transcript_id "transcript:HmN_000672600.1"; exon_number "1"
Here we just have 'transcript' and 'exon' lines, and the tag at the end gives the gene_id and transcript_id, so there is no need for 'Parent=' tags.
Here is a description of GTF: here.
Converting GFF to GTF:
My colleague Adam Reid has written a nice Perl script for convering GFF to GTF, eg.:
% perl /nfs/users/nfs_a/ar11/scripts/gff3wormbase2gtf.pl emultilocularis.gff3 > emultilocularis.gtf
% perl /nfs/users/nfs_a/ar11/scripts/gff3wormbase2gtf.pl emultilocularis.gff3 > emultilocularis.gtf
No comments:
Post a Comment