Friday, 9 November 2012

Getting alignments for all animal gene families from the TreeFam database

The TreeFam database includes multiple alignments (and phylogenetic trees based on those alignments) for all animal gene families, which is about 16,000 gene families in total. This can be a useful start-point for an evolutionary analysis of which gene families are present in which species, and how those gene families have evolved over time.

The data in TreeFam is stored in a mysql database, and you can either connect to and search the mysql database directly, or you can use Perl scripts to query the database.

I've recently written a perl script that you can use to retrieve the multiple alignments for the proteins in all ~16,000 TreeFam families from the TreeFam mysql database, for a particular version of the TreeFam database (eg. this script works for versions up to the latest version, TreeFam-8). The multiple alignments are retrieved in cigar-format (the format that they are stored in, in the database).

In order to convert from the cigar-format alignments to fasta-format multiple alignments, you will then need to run my perl script on the output from This will give you nice fasta-format multiple alignments for all ~16,000 TreeFam families. 

