Tuesday 27 August 2013

Cleaning up maker output

My colleague Eleanor Stanley has been using a variety of scripts to clean up the output files from the Maker gene prediction software. She has bundled them together in a shell script (/nfs/users/nfs_e/es9/pipeline/clean_maker_genes.sh).

This script runs several scripts written by me and others, to clean up Maker's final output gff file (round3.nofasta.gff.gz).

The scripts it runs are:
(i) ~es9/pipeline/contaminants.sh - splits the gff file into contaminated and non-contaminated gff files [input: round3.nofasta.gff, output: no_contaminants.gff]
(ii) ~es9/pipeline/remove_contaminant_from_gff.py - removes contaminated contigs from the gff file
(iii) remove_modelgff_genes_from_gff.pl(my script) - removes genes that are remnants from round3 gene training [input: no_contaminants.gff, output: A.gff]
(iv) rename_genes_in_maker_gff.pl (my script) - renames remaining genes [input: A.gff, output: B.gff]
(v) remove_tiny_genes_from_gff.pl (my script) - removes tiny genes that encode proteins of less than 30 residues [input: B.gff, output: C.gff]
(vi) find_best_nonoverlapping_genes.pl (my script) - remove the lowest scoring (nearest to 1) gene in an overlapping set (0 is the best score) [input: C.gff, output: D.gff]
(vii) merge_overlapping_exons.pl (my script) - merge overlapping/consecutive exons [input: E.gff, output: E.gff]
(viii) rename_genes_in_maker_gff.pl (my script) - renames remaining genes [input: E.gff, output: final.gff]
(ix) ~es9/pipeline/make_embedded_gff.sh - run eval
(x) get_spliced_transcripts_from_gff.pl (my script) - make protein sequences [input: final.gff, output: transcript.fa]
(xi) translate_spliced_dna.pl (my script) - make protein sequences [input: transcript.fa, output: protein.fa]
(xii) ~es9/pipeline/intstop.sh - final internal stop codons, if there are any
(xiii) maker_gff_find_incorrect_gene_merges_splits.pl - report genes that are potential splits and merges

No comments: