Thursday, 20 February 2014

Retrieving the protein sequences for a species from an Ensembl Compara database

Today I wanted to retrieve the protein sequences for Schistosoma mansoni from an Ensembl Compara database that my colleagues Alessandra Traini and Eleanor Stanley built for a project in our group.

I knew that the S. mansoni genome had been loaded into the database as an assembly ('schistosoma_mansoni') and a gff file of gene predictions.


It took me a while to figure out how to retrieve all the sequences of S. mansoni proteins from the database, ie. the S. mansoni proteome. Here goes (using the Ensembl 74 Compara Perl API):
(I have replaced the details of our local database with 'xxx's, to keep it private)

BEGIN {
        unshift (@INC, qw(/nfs/users/nfs_b/bh4/apps/ensembl-74/src/ensembl/modules /nfs/users/nfs_b/bh4/apps/ensembl-74/src/ensembl-compara/modules));
}

use strict;
use warnings;
use Bio::EnsEMBL::Compara::DBSQL::DBAdaptor;

my $num_args               = $#ARGV + 1;
if ($num_args != 1)
{
    print "Usage of get_proteome_for_cestode_species.pl\n\n";
    print "perl get_proteome_for_cestode_species.pl <species>\n";
    print "where <species> is the species name (eg. schistosoma_mansoni)\n";
    print "For example:\n";
    print "perl -w get_proteome_for_cestode_species.pl schistosoma_mansoni\n";
    exit;
}

# FIND THE NAME OF THE SPECIES OF INTEREST:                    

my $species                = $ARGV[0];

#------------------------------------------------------------------#

my $comparadb = new Bio::EnsEMBL::Compara::DBSQL::DBAdaptor (-host => 'xxx', -port => 3378, -user => 'xxx', -pass => 'xxx', -dbname => 'xxx');

# first get the taxon_id for species $species (eg. schistosoma_mansoni)
my $species_taxon_id = 'none';
my $gda = $comparadb->get_adaptor("GenomeDB");
my @genomes = @{$gda->fetch_all()};
foreach my $genome (@genomes){
   my $genome_name = $genome->name(); # eg. homo_sapiens
   my $taxon = $genome->taxon()->name(); # eg. Homo sapiens
   my $taxon_id = $genome->taxon_id();
   if ($genome_name eq $species)
   {
      if ($species_taxon_id ne 'none') { print STDERR "ERROR: already have species_taxon_id $species_taxon_id\n"; exit;}
      $species_taxon_id = $taxon_id;
   }
}
if ($species_taxon_id eq 'none') { print STDERR "ERROR: did not find species_taxon_id for $species\n"; exit;}

# get the Bio::EnsEMBL::Compara::SeqMember objects for the species $species
my $pma = $comparadb->get_adaptor("SeqMember");
my @proteins = @{$pma->fetch_all_by_source_taxon("Ensemblpep", $species_taxon_id)};

foreach my $protein (@proteins) {
    my $stable_id = $protein->stable_id();
    if (defined($protein->sequence()))
    {
       my $sequence = $protein->sequence();
       print ">$stable_id\n";
       print "$sequence\n";
    }
    else { print STDERR "ERROR: do not know sequence for $stable_id\n"; exit;}
}



Getting all protein sequences, versus the 'canonical' sequences
In Compara, if there are multiple splice-forms for a gene, one is chosen randomly as the 'canonical' protein to represent that gene. To get all the protein sequences for all splice-forms for all genes, you need to use (see above):
my $pma = $comparadb->get_adaptor("SeqMember");
my @proteins = @{$pma->fetch_all_by_source_taxon("Ensemblpep", $species_taxon_id)};

foreach my $protein (@proteins) {
    my $stable_id = $protein->stable_id();
    if (defined($protein->sequence()))
    {
       my $sequence = $protein->sequence();
       print ">$stable_id\n";
       print "$sequence\n";
    }
    else { print STDERR "ERROR: do not know sequence for $stable_id\n"; exit;}
}

To just get the protein sequences for the canonical splice-forms for all genes (one per gene), you need to use:
my $gma = $comparadb->get_adaptor("GeneMember");
my @genes = @ { $gma->fetch_all_by_source_taxon("ENSEMBLGENE", $species_taxon_id) } ;

foreach my $gene (@genes) {
   my $stable_id = $gene->stable_id();
   my $canonical = $gene->get_canonical_SeqMember();
   if (defined($canonical->sequence()))
   {
      my $sequence = $canonical->sequence();
      print ">$stable_id\n";
      print "$sequence\n";
   }
   else { print "ERROR: no sequence for gene $stable_id\n"; exit;}
}

Note that the second version gets a 'GeneMember' object rather than a 'SeqMember' object (this is because a GeneMember object has a 'canonical Seqmember' attached to it, but a SeqMember object doesn't.