Friday, 1 March 2013

Perl scripts for retrieving data from the TreeFam database

Fabian Schreiber and Mateus Patricio at the EBI are now in charge of the TreeFam database, and are in the process of building TreeFam-9 at the moment. They are going to provide new tools to access the TreeFam database more easily.

Up until now, the easiest way to retrieve data from TreeFam has been to use Perl scripts to extract data from the database. These Perl scripts can either use the TreeFam Perl API, or connect directly to the mysql database and query it (using the Perl DBI module for mysql). These Perl scripts work with all versions of the TreeFam database up to and including version 8 (but will not work with versions 9 and later, for which there will be a new Perl interface).

Here are some of the scripts that I've written in the past to retrieve data from the TreeFam database in this way. Note that some of them haven't been tested for a while (I've marked these with *):
[Note also that it is likely that new scripts will be available soon for analysing the TreeFam-9 database and later releases; keep an eye on the TreeFam website.]

TreeFam families
list_treefam_families.pl (*): makes a list of all families in the TreeFam mysql database
treefam_release2.pl (*): prints out the total number of genes in families, and the total number of families, in a particular TreeFam release

TreeFam families for genes
find_treefam_for_schisto_gene2.pl (*): given a list of Schistosoma mansoni genes, connects to the TreeFam database to find out which families they are in
find_treefam_with_Ce_Bm.pl  (*): finds TreeFam families that have Caenorhabditis elegans and Brugia malayi genes, and prints out the number of genes from each species in the trees for each of those families
list_treefam_genes3.pl (*): connects to the TreeFam mysql database, and prints out a list of Caenorhabditis elegans and  C. briggsae genes in TreeFam families
find_simple_treefam_families4.pl  (*): connects to the TreeFam mysql database, and retrieves all families that have just one human, one rat, one chicken, one Caenorhabditis elegans, and one Drosophila melanogaster gene (as well as possible additional genes from other species)
treefam_4_genes.pl (*): prints out all the genes in a particular TreeFam family

TreeFam species
store_treefam_species.pl (*): retrieves a list of all the fully sequenced species that are in the TreeFam database, and stores them in a Perl pickle

Protein sequences for TreeFam families
get_treefam_family_seqs.pl : get protein sequences for all families in a particular version of the database
get_treefam_family_seqs2.pl  (*): prints out all the protein sequences in a particular TreeFam family

Protein alignments for TreeFam families
get_treefam_alns.pl : get protein alignments (in cigar format) for all families in a particular version of the database
translate_treefam_cigars_to_alns.pl : translate cigar-format alignments for families to fasta-format alignments

Nucleotide alignments for TreeFam families
store_treefam_full_nt_alns.pl (*) : retrieves the full nucleotide alignments for TreeFam families, and stores them in a Perl pickle
find_human_paralog_alns.pl (*): for a particular TreeFam family of interest (that has human paralogs), prints out the DNA alignment for the family, with the position of introns shown with respect to the DNA alignment [note that happens to be written for the case of a family that has human paralogs, but could in fact be used for any family]

HMMs for TreeFam families
make_aln_and_hmm_for_treefam_family.pl : for a particular family, retrieve the protein sequences from the database, align them with muscle, and build a HMM using hmmer
translate_treefam_cigars_to_hmms.pl : for a particular family, reads in a cigar-format alignment for the family, and makes a HMM for the family using hmmer
map_introns_to_HMM.pl (*): reads in an alignment file corresponding to a HMM, and retrieves the positions of introns in the genes from the TreeFam database, and figures out the positions of introns with respect to the HMM columns

Conservation of intron-exon structure 
find_intron_cons_treefam_ortholog.pl (*): given a gff file for Caenorhabditis elegans, finds the fraction of introns in each Caenorhabditis elegans gene that are shared in position in the ortholog (in TreeFam) in C. briggsae, human or yeast.

Finding potential gene prediction errors using TreeFam 
find_gene_pred_errors1.pl (*): uses TreeFam to find cases where two adjacent genes in a species (eg. Caenorhabditis elegans) should probably be merged, as one of them has its best match to the first part of a TreeFam family alignment, and the second has its best match to the second part of the same TreeFam family's alignment
badgenes_in_alns2.pl (*): reads in a fasta-format alignment, and finds sequences that align to <x% of the alignment length

Retrieving trees for TreeFam families
store_treefam_trees.pl (*): retrieves all trees from the TreeFam database, and stores them in a Perl pickle
get_trees2.pl (*): gets the TreeFam clean tree for a family
parse_treefam_bioperl.pl (*): connects to the TreeFam mysql database, and parses the trees using Bioperl

Topologies of TreeFam trees
treefam_dog_man_mouse.pl (*): connects to the TreeFam mysql database, and prints out of a list of the TreeFam trees that contain the different possible topologies with respect to the relationship between man, dog, and mouse

Orthologs for TreeFam genes
get_orthologs8.pl (*): retrieves all TreeFam trees, and infers orthologs between a pair of species based on the tree, by finding cases where the last common ancestor node of two genes from different species is a speciation node [note that this does not take orthologs from the 'ortholog' table of the TreeFam database, but instead infers them from TreeFam trees itself]
store_treefam_orthostrap.pl (*): retrieves orthology bootstrap values for ortholog pairs for a particular pair of species from the TreeFam database, and stores the orthology bootstrap values in a Perl pickle
check_if_have_treefam_ortholog.pl (*): given a gff file of Caenorhabditis elegans genes, finds their  C. briggsae, human and yeast orthologs from TreeFam
find_pc_id_to_treefam_ortholog.pl  (*): given a gff file of Caenorhabditis elegans genes, retrieves their orthologs in C. briggsae, human and yeast from the TreeFam database, and finds the percent identity between each C. elegans gene and each of its orthologs
get_orths_from_newick_tree.pl  (*): read a Newick tree file from TreeFam, and print out the orthologs of a particular input gene based on the tree

Paralogs for TreeFam genes
find_schisto_paralogs.pl  (*): given a nhx-format tree file for a tree for a TreeFam family, finds Schistosoma mansoni/S. japonicum/Nematostella vectensis paralog pairs, and gives the ancestral taxon in which the duplication giving rise to the paralogs occcurred)
find_human_paralogs.pl  (*): retrieves trees from the TreeFam database, and infers human within-species paralogs from the trees
find_closest_worm_paralogs3.pl (*): given a file of Caenorhabditis elegans paralog pairs, analyses TreeFam trees to find the pairs of C. elegans paralogs in families that are separated by the least number of edges in the trees
find_closest_worm_paralogs4.pl (*): given a file of Caenorhabditis elegans paralog pairs, finds the bootstrap value for the clade defined by the last common ancestor of the two paralogs
count_worm_paralogs2c.pl (*): given a list of Caenorhabditis elegans paralog pairs, uses the TreeFam tree that they are in to calculate information about the paralogs and the tree
check_if_adjacent_genes_are_paralogs.pl  (*): given a gff file of Caenorhabditis elegans genes, uses TreeFam to check whether adjacent gens are paralogs
treefam_flatworm.pl (*): connects to the TreeFam mysql database, and finds Schistosoma mansoni genes that are single-copy, that have multiple orthologs in most other animals
treefam_flatworm2.pl (*): connects to the TreeFam mysql database, and finds Schistosoma mansoni genes that are multi-copy, but that have just one or two orthologs in most other animals
treefam_flatworm3.pl  (*): connects to the TreeFam mysql database, and retrieves within-species Schistosoma mansoni paralogs from the 'ortholog' table of the database.

Singleton genes in TreeFam families
get_singletons.pl (*): identifies singleton genes in a species, by finding genes from that species that appear in TreeFam families that do not have any other genes from that species
find_simple_families3.pl (*): connects to the TreeFam mysql database, and retrieves all families that have just one human, one rat, one chicken, one Caenorhabditis elegans, and one Drosophila melanogaster gene (as well as possible additional genes from other species) 

Location of orthologs for TreeFam genes
treefam_synteny3.pl (*): retrieves orthologs for a particular pair of species from the TreeFam database, and checks whether the ortholog pair in the two species is flanked by left-hand and right-hand neighbours that are also orthologous
get_chroms_from_treefam.pl (*): reads in a list of 2-C.elegans-to-1-C.briggsae orthologs, and finds cases where the two Caenorhabditis elegans genes are on different chromosomes, with one on an autosome and one X-linked

Identifying gene losses in TreeFam trees
treefam_gene_losses.pl (*): identifies gene losses in human since divergence from chimp, in the trees for TreeFam-A families (does not analyse TreeFam-B families at present)
treefam_4_losses.pl (*): prints out all the gene losses identified in a particular TreeFam family based on the tree for the family

Inferring features of ancestral nodes in trees
treefam_infer_ancestral_features.pl (*): given a list of TreeFam families, and a file of features of the sequences in trees (eg. Pfam domains or GO terms), infers the likely features of the ancestral nodes in the trees for the families
treefam_infer_ancestral_GOids.pl (*): given a file with GO annotations for sequences in TreeFam families, and a list of families, infers the GO annotations for ancestral nodes in the trees for those families
treefam_infer_ancestral_GOids3.pl (*): given a file with GO annotations for sequences in TreeFam families, and a list of families, infers the GO annotations for ancestral nodes in the trees for those families [uses a different algorithm than treefan_infer_ancestral_GOids.pl]

Checks on the TreeFam database (of most use to TreeFam developers)

treefam_overlaps2.pl (*): identifies genes that appear in more than one TreeFam-A seed tree.
treefam_QC1.pl (*): finds TreeFam proteins that have a strong match to a family in the TreeFam mysql hmmer_matches table, but where the gene was not added to the 'fam_genes' table for the family or to any family
treefam_QC2.pl (*): finds TreeFam families that are lacking a tree in the 'trees' table of the TreeFam mysql database
treefam_QC3.pl  (*): finds cases where a TreeFam family is listed in the TreeFam mysql database, but has no genes listed in the 'fam_genes' table
treefam_QC4.pl  (*): finds TreeFam proteins that have a match to a family in the TreeFam mysql 'hmmer_matches' table, but where the gene does not appear in the 'genes' table
treefam_QC5.pl (*): finds cases where a full gene set for a species was loaded into the TreeFam mysql database, but no genes from that species were added to families
treefam_QC6.pl (*): finds cases where more than one alternative splice form from the same gene was added to a family
treefam_QC7.pl (*): finds cases where different alternative spliceforms of the same gene do not have unique transcript ids in the 'genes' table of the TreeFam mysql database
treefam_QC8.pl (*): finds cases where a transcript listed in the 'genes' table of the TreeFam mysql database lacks any amino acid sequence in the 'aa_seq' table, or lacks a DNA sequence in the 'nt_seq' table
treefam_QC9.pl (*): finds TreeFam transcripts that appear in the 'fam_genes' table of the TreeFam mysql database, but do not appear in the 'genes' table
treefam_QC10.pl  (*): finds TreeFam proteins that were added to a particular family, but actually have a stronger hmmer match to a different family
treefam_QC11.pl  (*): finds cases where different alternative splices of the same gene were put into different families, but those alternative splice forms overlap a lot at the DNA level
treefam_QC12.pl (*): checks for cases where a TreeFam family seems to have disappeared from a particular version of TreeFam, even though it was present in the previous version of TreeFam and has not been curated since

No comments: