Friday 17 June 2016

Scripts for gene statistics based on a gff

I've written some scripts to calculate gene statistics based on a gff file, here's how they work (to remind myself)..

(1) Given a protein fasta file, find the mean and median protein length:
First you get the longest transcript for each gene:

This needs a list of the transcript names:
eg.
% grep mRNA augustus3.c.gff3 | grep 'Parent=g' | tr ':' ' ' | tr ';' ' ' | cut -f9 | tr "=" ' ' | awk '{ print $2"\t"$4 }' > augustus3.transcripts 
16,767 transcripts

Then:
[script uses python2 because of biopython version]
% python2 /nfs/users/nfs_a/alc/Documents/git/Python/take_longest_spliceform_only.py aug.c3.aa augustus3.transcripts augustus3.longest.pep
12,405 transcripts

Then:
[script uses python2 because of biopython version]
% python2 ~alc/Documents/git/Python/calc_mean_median_protein_len.py augustus3.longest.pep
Mean length=510.306812, median length=336.000000

(2) Given the gff file, calculate the intron statistics:
First make a gff file with the longest transcript:
% python2 /nfs/users/nfs_a/alc/Documents/git/Python/make_gff_for_spliceforms.py augustus3.longest.pep augustus3.c.gff3 augustus3.c.longest.gff3

Then calculate the total length of intron DNA, number of introns, mean and median intron length: % python3 ~alc/Documents/git/Python/calc_total_intron_len.py augustus3.c.longest.gff3
Num_scaffolds=483
Num_genes=12405
Num_cds_regions (coding exons)=74902
Total intron bases=156226973
Number of introns=62497
Median length of intron regions=1610.000000 bp
Mean length of intron regions=2501.492744 bp


(3) Given the gff file, calculate the coding exon statistics (total amount of bases in coding exons, number of coding exons, median and mean length of coding exons, mean and median coding exons per gene):

% python3 ~alc/Documents/git/Python/calc_total_exon_len.py augustus3.c.longest.gff3
Num_scaffolds=483
Num_genes=12405
Num_cds_regions (coding exons)=74902
Total coding exon (CDS) bases=19024352
Number of CDS regions=74902
Median length of CDS regions=165.000000 bp
Mean length of CDS regions=254.041868 bp
Mean number of CDS per gene=6.038049
Median number of CDS per gene=4.000000

No comments: