Fabian Schreiber and Mateus Patricio at the EBI are now in charge of the TreeFam database, and are in the process of building TreeFam-9 at the moment. They are going to provide new tools to access the TreeFam database more easily.
Up until now, the easiest way to retrieve data from TreeFam has been to use Perl scripts to extract data from the database. These Perl scripts can either use the TreeFam Perl API, or connect directly to the mysql database and query it (using the Perl DBI module for mysql). These Perl scripts work with all versions of the TreeFam database up to and including version 8 (but will not work with versions 9 and later, for which there will be a new Perl interface).
Here are some of the scripts that I've written in the past to retrieve data from the TreeFam database in this way. Note that some of them haven't been tested for a while (I've marked these with *):
[Note also that it is likely that new scripts will be available soon for analysing the TreeFam-9 database and later releases; keep an eye on the TreeFam website.]
TreeFam families
list_treefam_families.pl (*): makes a list of all families in the TreeFam mysql database
treefam_release2.pl (*): prints out the total number of genes in families, and the total number of families, in a particular TreeFam release
TreeFam families for genes
find_treefam_for_schisto_gene2.pl (*): given a list of Schistosoma mansoni genes, connects to the TreeFam database to find out which families they are in
find_treefam_with_Ce_Bm.pl (*): finds TreeFam families that have Caenorhabditis elegans and Brugia
malayi genes, and prints out the number of genes from each species in
the trees for each of those families
list_treefam_genes3.pl (*): connects to the TreeFam mysql database, and prints out a list of
Caenorhabditis elegans and C. briggsae genes in TreeFam
families
find_simple_treefam_families4.pl (*): connects to the TreeFam mysql database, and retrieves all families that
have just one human, one rat, one chicken, one Caenorhabditis elegans,
and one Drosophila melanogaster gene (as well as possible additional
genes from other species)
treefam_4_genes.pl (*): prints out all the genes in a particular TreeFam family
TreeFam species
store_treefam_species.pl (*): retrieves a list of all the fully sequenced species that are in the TreeFam database, and stores them in a Perl pickle
Protein sequences for TreeFam families
get_treefam_family_seqs.pl : get protein sequences for all families in a particular version of the database
get_treefam_family_seqs2.pl (*): prints out all the protein sequences in a particular TreeFam family
Protein alignments for TreeFam families
get_treefam_alns.pl : get protein alignments (in cigar format) for all families in a particular version of the database
translate_treefam_cigars_to_alns.pl : translate cigar-format alignments for families to fasta-format alignments
Nucleotide alignments for TreeFam families
store_treefam_full_nt_alns.pl (*) : retrieves the full nucleotide alignments for TreeFam families, and stores them in a Perl pickle
find_human_paralog_alns.pl (*): for a particular TreeFam family of interest (that has human paralogs),
prints out the DNA alignment for the family, with the position of
introns shown with respect to the DNA alignment [note that happens to be written for the case of a family that has human paralogs, but could in fact be used for any family]
HMMs for TreeFam families
make_aln_and_hmm_for_treefam_family.pl : for a particular family, retrieve the protein sequences from the database, align them with muscle, and build a HMM using hmmer
translate_treefam_cigars_to_hmms.pl : for a particular family, reads in a cigar-format alignment for the family, and makes a HMM for the family using hmmer
map_introns_to_HMM.pl (*): reads in an alignment file corresponding to a HMM, and retrieves the
positions of introns in the genes from the TreeFam database, and figures
out the positions of introns with respect to the HMM columns
Conservation of intron-exon structure
find_intron_cons_treefam_ortholog.pl (*): given a gff file for Caenorhabditis elegans, finds the fraction of
introns in each Caenorhabditis elegans gene that are shared in position
in the ortholog (in TreeFam) in C. briggsae, human or yeast.
Finding potential gene prediction errors using TreeFam
find_gene_pred_errors1.pl (*): uses TreeFam to find cases where two adjacent genes in a species (eg.
Caenorhabditis elegans) should probably be merged, as one of them has
its best match to the first part of a TreeFam family alignment, and the
second has its best match to the second part of the same TreeFam
family's alignment
badgenes_in_alns2.pl (*): reads in a fasta-format alignment, and finds sequences that align to <x% of the alignment length
Retrieving trees for TreeFam families
store_treefam_trees.pl (*): retrieves all trees from the TreeFam database, and stores them in a Perl pickle
get_trees2.pl (*): gets the TreeFam clean tree for a family
parse_treefam_bioperl.pl (*): connects to the TreeFam mysql database, and parses the trees using Bioperl
Topologies of TreeFam trees
treefam_dog_man_mouse.pl (*): connects to the TreeFam mysql database, and prints out of a list of the
TreeFam trees that contain the different possible topologies with
respect to the relationship between man, dog, and mouse
Orthologs for TreeFam genes
get_orthologs8.pl (*): retrieves all TreeFam trees, and infers orthologs between a pair of species based on the tree, by
finding cases where the last common ancestor node of two genes from
different species is a speciation node [note that this does not take orthologs from the 'ortholog' table of the TreeFam database, but instead infers them from TreeFam trees itself]
store_treefam_orthostrap.pl (*): retrieves orthology bootstrap values for ortholog pairs for a particular
pair of species from the TreeFam database, and stores the orthology
bootstrap values in a Perl pickle
check_if_have_treefam_ortholog.pl (*): given a gff file of Caenorhabditis elegans genes, finds their C. briggsae, human and yeast orthologs from TreeFam
find_pc_id_to_treefam_ortholog.pl (*): given a gff file of Caenorhabditis elegans genes, retrieves their
orthologs in C. briggsae, human and yeast from the TreeFam database, and
finds the percent identity between each C. elegans gene and each of its
orthologs
get_orths_from_newick_tree.pl (*): read a Newick tree file from TreeFam, and print out the orthologs of a particular input gene based on the tree
Paralogs for TreeFam genes
find_schisto_paralogs.pl (*): given a nhx-format tree file for a tree for a TreeFam family, finds
Schistosoma mansoni/S. japonicum/Nematostella vectensis paralog pairs,
and gives the ancestral taxon in which the duplication giving rise to
the paralogs occcurred)
find_human_paralogs.pl (*): retrieves trees from the TreeFam database, and infers human within-species paralogs from the trees
find_closest_worm_paralogs3.pl (*): given a file of Caenorhabditis elegans paralog pairs, analyses TreeFam
trees to find the pairs of C. elegans paralogs in families that are
separated by the least number of edges in the trees
find_closest_worm_paralogs4.pl (*): given a file of Caenorhabditis elegans paralog pairs, finds the
bootstrap value for the clade defined by the last common ancestor of the
two paralogs
count_worm_paralogs2c.pl (*): given a list of Caenorhabditis elegans paralog pairs, uses the TreeFam
tree that they are in to calculate information about the paralogs and
the tree
check_if_adjacent_genes_are_paralogs.pl (*): given a gff file of Caenorhabditis elegans genes, uses TreeFam to check whether adjacent gens are paralogs
treefam_flatworm.pl (*): connects to the TreeFam mysql database, and finds Schistosoma mansoni
genes that are single-copy, that have multiple orthologs in most other
animals
treefam_flatworm2.pl (*): connects to the TreeFam mysql database, and finds Schistosoma mansoni
genes that are multi-copy, but that have just one or two orthologs in
most other animals
treefam_flatworm3.pl (*): connects to the TreeFam mysql database, and retrieves within-species
Schistosoma mansoni paralogs from the 'ortholog' table of the database.
Singleton genes in TreeFam families
get_singletons.pl (*): identifies singleton genes in a species, by finding genes from that
species that appear in TreeFam families that do not have any other genes
from that species
find_simple_families3.pl (*): connects to the TreeFam mysql database, and retrieves all families that
have just one human, one rat, one chicken, one Caenorhabditis elegans,
and one Drosophila melanogaster gene (as well as possible additional
genes from other species)
Location of orthologs for TreeFam genes
treefam_synteny3.pl (*): retrieves orthologs for a particular pair of species from the TreeFam
database, and checks whether the ortholog pair in the two species is
flanked by left-hand and right-hand neighbours that are also orthologous
get_chroms_from_treefam.pl (*): reads in a list of 2-C.elegans-to-1-C.briggsae orthologs, and finds
cases where the two Caenorhabditis elegans genes are on different
chromosomes, with one on an autosome and one X-linked
Identifying gene losses in TreeFam trees
treefam_gene_losses.pl (*): identifies gene losses in human since divergence from chimp, in the trees for TreeFam-A families (does not analyse TreeFam-B families at present)
treefam_4_losses.pl (*): prints out all the gene losses identified in a particular TreeFam family based on the tree for the family
Inferring features of ancestral nodes in trees
treefam_infer_ancestral_features.pl (*): given a list of TreeFam families, and a file of features of the
sequences in trees (eg. Pfam domains or GO terms), infers the likely
features of the ancestral nodes in the trees for the families
treefam_infer_ancestral_GOids.pl (*): given a file with GO annotations for sequences in TreeFam families, and a
list of families, infers the GO annotations for ancestral nodes in the
trees for those families
treefam_infer_ancestral_GOids3.pl (*): given a file with GO annotations for sequences in TreeFam families, and a
list of families, infers the GO annotations for ancestral nodes in the
trees for those families [uses a different algorithm than treefan_infer_ancestral_GOids.pl]
Checks on the TreeFam database (of most use to TreeFam developers)
treefam_overlaps2.pl (*): identifies genes that appear in more than one TreeFam-A seed tree.
treefam_QC1.pl (*): finds TreeFam proteins that have a strong match to a family in the
TreeFam mysql hmmer_matches table, but where the gene was not added to
the 'fam_genes' table for the family or to any family
treefam_QC2.pl (*): finds TreeFam families that are lacking a tree in the 'trees' table of the TreeFam mysql database
treefam_QC3.pl (*): finds cases where a TreeFam family is listed in the TreeFam mysql database, but has no genes listed in the 'fam_genes' table
treefam_QC4.pl (*): finds TreeFam proteins that have a match to a family in the TreeFam
mysql 'hmmer_matches' table, but where the gene does not appear in the
'genes' table
treefam_QC5.pl (*): finds cases where a full gene set for a species was loaded into the
TreeFam mysql database, but no genes from that species were added to
families
treefam_QC6.pl (*): finds cases where more than one alternative splice form from the same gene was added to a family
treefam_QC7.pl (*): finds cases where different alternative spliceforms of the same gene do
not have unique transcript ids in the 'genes' table of the TreeFam mysql
database
treefam_QC8.pl (*): finds cases where a transcript listed in the 'genes' table of the
TreeFam mysql database lacks any amino acid sequence in the 'aa_seq'
table, or lacks a DNA sequence in the 'nt_seq' table
treefam_QC9.pl (*): finds TreeFam transcripts that appear in the 'fam_genes' table of the
TreeFam mysql database, but do not appear in the 'genes' table
treefam_QC10.pl (*): finds TreeFam proteins that were added to a particular family, but actually have a stronger hmmer match to a different family
treefam_QC11.pl (*): finds cases where different alternative splices of the same gene were
put into different families, but those alternative splice forms overlap a
lot at the DNA level
treefam_QC12.pl (*): checks for cases where a TreeFam family seems to have disappeared from a
particular version of TreeFam, even though it was present in the
previous version of TreeFam and has not been curated since
No comments:
Post a Comment