The TreeFam database includes multiple alignments (and phylogenetic trees based on those alignments) for all animal gene families, which is about 16,000 gene families in total. This can be a useful start-point for an evolutionary analysis of which gene families are present in which species, and how those gene families have evolved over time.
The data in TreeFam is stored in a mysql database, and you can either connect to and search the mysql database directly, or you can use Perl scripts to query the database.
I've recently written a perl script get_treefam_alns.pl that you can use to retrieve the multiple alignments for the proteins in all ~16,000 TreeFam families from the TreeFam mysql database, for a particular version of the TreeFam database (eg. this script works for versions up to the latest version, TreeFam-8). The multiple alignments are retrieved in cigar-format (the format that they are stored in, in the database).
In order to convert from the cigar-format alignments to fasta-format multiple alignments, you will then need to run my perl script translate_treefam_cigars_to_alns.pl on the output from get_treefam_alns.pl. This will give you nice fasta-format multiple alignments for all ~16,000 TreeFam families.
No comments:
Post a Comment