Friday 28 June 2019

Retrieving data from WormBase using the REST API

I recently wrote a blog post on using the WormBase ParaSite REST API to retrieve data from their database. For C. elegans data, it's best to use the main WormBase website, and there is also a WormBase REST API.

My problem today was to retrieve phenotype information (from RNAi, mutants) for C. elegans genes. Ages ago, I wrote a blog post doing this by querying the WormBase website through their standard webpage. But, let's modernise and use the REST API instead!

I found that it's possible to do a REST API query to get the phenotypes for a gene, by typing for example:
for gene WBGene00000079.

Python scripts to retrieve phenotypes for a gene, or a list of genes
I also wrote a Python script to do the same query using Python, see here. When you run it you should see:
% python3
id= WBPhenotype:0001952 label= germline nuclear positioning variant
id= WBPhenotype:0001969 label= germ cell compartment morphology variant
id= WBPhenotype:0000186 label= oogenesis variant
id= WBPhenotype:0000640 label= egg laying variant
id= WBPhenotype:0001940 label= rachis morphology variant
id= WBPhenotype:0001980 label= germ cell compartment expansion variant
id= WBPhenotype:0000638 label= molt defect
id= WBPhenotype:0000154 label= reduced brood size
id= WBPhenotype:0001973 label= germ cell compartment size variant
id= WBPhenotype:0000059 label= larval arrest
id= WBPhenotype:0000697 label= protruding vulva

Then I wrote another Python script that retrieves all the phenotypes for an input list of genes: see here.
For example, for an input list of C.elegans genes:
1   WBGene00000079
2   WBGene00001484
3   WBGene00001948
you can run it by typing:
% python3 mytmp mytmp.out
and the output is:
WBGene00000079  WBPhenotype:0001952     germline nuclear positioning variant
WBGene00000079  WBPhenotype:0001969     germ cell compartment morphology variant
WBGene00000079  WBPhenotype:0000186     oogenesis variant
WBGene00000079  WBPhenotype:0000640     egg laying variant
WBGene00000079  WBPhenotype:0001940     rachis morphology variant
WBGene00000079  WBPhenotype:0001980     germ cell compartment expansion variant
WBGene00000079  WBPhenotype:0000638     molt defect
WBGene00000079  WBPhenotype:0000154     reduced brood size
WBGene00000079  WBPhenotype:0001973     germ cell compartment size variant
WBGene00000079  WBPhenotype:0000059     larval arrest
WBGene00000079  WBPhenotype:0000697     protruding vulva
WBGene00001484  WBPhenotype:0000640     egg laying variant
WBGene00001948  WBPhenotype:0000062     lethal
WBGene00001948  WBPhenotype:0000054     larval lethal
WBGene00001948  WBPhenotype:0000867     embryonic arrest
WBGene00001948  WBPhenotype:0000643     locomotion variant
WBGene00001948  WBPhenotype:0000050     embryonic lethal
WBGene00001948  WBPhenotype:0000053     paralyzed arrested elongation two fold
WBGene00001948  WBPhenotype:0000535     organism morphology variant
WBGene00001948  WBPhenotype:0000406     lumpy
WBGene00001948  WBPhenotype:0000583     dumpy
WBGene00001948  WBPhenotype:0000688     sterile
WBGene00001948  WBPhenotype:0000669     sex muscle development variant
WBGene00001948  WBPhenotype:0000154     reduced brood size
WBGene00001948  WBPhenotype:0000861     body wall muscle development variant
WBGene00001948  WBPhenotype:0000861     body wall muscle development variant
WBGene00001948  WBPhenotype:0000861     body wall muscle development variant
WBGene00001948  WBPhenotype:0001913     excess coelomocytes
WBGene00001948  WBPhenotype:0000095     M lineage variant
WBGene00001948  WBPhenotype:0000031     slow growth

Python script to retrieve references for a gene

Here is a little script to retrieve the references (e.g. papers) for a gene:
import requests, sys

server = ""
ext = "/rest/field/gene/WBGene00000079/references"

r = requests.get(server+ext, headers={ "Content-Type" : "application/json", "Accept" : ""})

if not r.ok:

decoded = r.json()
# print(decoded)

# based on looking at the example
refs = decoded["references"]
refs = refs["data"]
refs = refs["results"]
refcnt = 0
for ref in refs:
    refcnt += 1
    title = ref["title"]
    title = title[0]
    abstract = ref["abstract"]
    abstract = abstract[0]
    title_words = title.split()
    abstract_words = abstract.split()
    if 'development' in title_words:


Retrieving the gene name for a gene

You can retrieve the gene name for a gene by using the REST API e.g. .

Wednesday 26 June 2019

Target prediction in ChEMBL

I recently wrote a blogpost on using the ChEMBL REST API to retrieve compounds with bioactivities for particular targets, as well as compound information (structure, logP etc.) for those compounds, from the ChEMBL database.

What about retrieving information on which (ChEMBL) targets are predicted for a particular ChEMBL compound? If you look at the ChEMBL page for a compound (e.g. imatinib),  at the bottom of the page you can see a 'Target predictions' section, containing predicted protein targets in ChEMBL. These are based on the prediction approach described here and here. These predictions give a good first approximation of whether there is a predicted binding between a compound and a ChEMBL target.

Simple REST API queries through the ChEMBL website
It turns out this is easy to retrieve this target prediction info. using the ChEMBL REST API (just for compounds and targets already in ChEMBL). For example, to get target predictions for imatinib, we can type into a web browser:
and we should see:

The fields shown here are:
in_training: was the predicted compound in the training set? yes: 1 / no: 0.
molecule_chembl_id: id of the predicted compound (presumably a parent chembl id)
pred_id: id of the prediction for the target - compound pair
probability: probability that the compound is active on the target. It ranges from 0 (predicted not active) to 1 (predicted active)
target_accession: uniprot_id of the target
target_chembl_id: the ChEMBL id for the target
target_organism: the organism of the target
target_tax_id: NCBI taxonomy id of target
value: 1 or 10, because 2 models were built. One with an activity threshold of 1 uM and the other at 10 uM

Simple REST API queries using Python
I wrote a Python script to retrieve all the predicted targets from ChEMBL, for an input list of ChEMBL compounds. You can see it here.

Thank you to Fiona Hunter and Nicolas Bosc from ChEMBL for advice about the ChEMBL target prediction REST API!

Tuesday 25 June 2019


The TRANSFAC database contains experimentally determined transcription factor binding sites, and is a rich source of information for those interested in promoters and gene regulation.

You can create a free account on their website, and get transcription factor binding sites for a transcription factor by searching.

Finding binding sites of muscle-specific transcription factors in TRANSFAC
For example, I am interested in muscle-specific transcription factors such as MyoD. I searched for them in TRANSFAC like this: (at the search page)

1. Once you log in, you can then click on 'Search the TRANSFAC Public database', and on the search page you can choose to search in the 'Factor' table, and then you can choose search term 'muscle', and select 'Cell specificity (positive)' in the 'Table field to search in' menu:

2. and this will bring up a list of muscle-specific transcription factors. If you click on a particular transcription factor (e.g. MyoD, or hlh1 (the C.elegans ortholog of MyoD)), it will give you details such as 'BS' (binding site) which tells you the binding site sequence. )

(I can't show the MyoD entry here as it's copyright info, but make yourself a free account and you can view it too!)

Searching for binding sites in a sequence using TRANSFAC
You can also use the TRANSFAC website to search for binding sites in your sequence of interest. Go to MATCH webpage, and paste in your sequence there, and press 'Submit' to search.

C'est belle...

Wednesday 19 June 2019

Parsing an XML-format phylogenetic tree file

In a previous post, I wrote about retrieving data from WormBase ParaSite using their REST API.

I found that when you ask WormBase ParaSite for a phylogenetic tree, you can get it in JSON format or XML format. XML format seems a good idea as it is easy to parse phylogenetic trees in XML format. Here's an example: 

Retrieving an XML format phylogenetic tree from WormBase
First I retrieved the gene tree containing the gene WBGene00221255 from WormBase ParaSite by using the following Python script:
% python3 > example.xml
where the script is:

# script to retrieve the gene tree containing a particular gene from WormBase ParaSite
# example script taken from

import requests, sys

server = ""
ext = "/rest-13/genetree/member/id/WBGene00221255?"

r = requests.get(server+ext, headers={ "Content-Type" : "text/x-phyloxml+xml", "Accept" : ""})

if not r.ok:


This gave me an output xml file example.xml, that contains an XML-format phylogenetic tree:
 <?xml version="1.0" encoding="UTF-8"?>

<phyloxml xsi:schemaLocation="" xmlns:xsi="" x
  <phylogeny rooted="true" type="gene tree">
    <clade branch_length="0">
      <clade branch_length="0">
        <confidence type="bootstrap">46</confidence>
        <clade branch_length="0.056919">
          <confidence type="bootstrap">59</confidence>
          <clade branch_length="0.140118">
            <confidence type="bootstrap">35</confidence>
            <clade branch_length="0.048399">
              <confidence type="bootstrap">35</confidence>
              <clade branch_length="0.005286">
                  <scientific_name>Brugia malayi strain FR3</scientific_name>
                  <common_name>Brugia malayi (PRJNA10729)</common_name>
                  <accession source="Ensembl">Bm994.1</accession>

                <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">brugia_malayi_prjna10729</property>
              <clade branch_length="0.108384">
                  <scientific_name>Brugia pahangi strain Scotland/Glasgow</scientific_name>
                  <common_name>Brugia pahangi (PRJEB497)</common_name>
                  <accession source="Ensembl">BPAG_0000765501-mRNA-1</accession>
                <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">brugia_pahangi_prjeb497</property>
            <clade branch_length="0.046778">
                <scientific_name>Wuchereria bancrofti W_bancrofti_Jakarta_v2_0_4</scientific_name>
                <common_name>Wuchereria bancrofti (PRJEB536)</common_name>
                <accession source="Ensembl">WBA_0000831701-mRNA-1</accession>
              <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">wuchereria_bancrofti_prjeb536</property>
          <clade branch_length="0.158289">
              <scientific_name>Loa loa strain L. loa Cameroon isolate</scientific_name>
              <common_name>Loa loa (PRJNA37757)</common_name>
              <accession source="Ensembl">EFO17702.1</accession>
            <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">loa_loa_prjna37757</property>
        <clade branch_length="0.514023">
            <scientific_name>Litomosoides sigmodontis strain Bain lab strain</scientific_name>
            <common_name>Litomosoides sigmodontis (PRJEB3075)</common_name>
            <accession source="Ensembl">nLs.2.1.2.t06925-RA</accession>
          <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">litomosoides_sigmodontis_prjeb3075</property>
      <clade branch_length="0.109525">
        <confidence type="bootstrap">85</confidence>
        <clade branch_length="0.207304">
            <scientific_name>Onchocerca volvulus strain O. volvulus Cameroon isolate</scientific_name>
            <common_name>Onchocerca volvulus (PRJEB513)</common_name>
            <accession source="Ensembl">OVOC2199.2</accession>
          <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">onchocerca_volvulus_prjeb513</property>
        <clade branch_length="0.254764">
            <scientific_name>Dirofilaria immitis</scientific_name>
            <common_name>Dirofilaria immitis (PRJEB1797)</common_name>
            <accession source="Ensembl">nDi.2.2.2.t04666</accession>
          <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">dirofilaria_immitis_prjeb1797</property>
    <property datatype="xsd:string" ref="Compara:gene_tree_stable_id" applies_to="phylogeny">WBGT00000000028368</property>

When you look at the gene tree in WormBase ParaSite, it looks like this:

Parsing an XML format phylogenetic tree using ETE
The next step was to parse the phylogenetic tree. The tree is in the PhyloXML format. Luckily, it turns out that the ETE toolkit (a lovely toolkit for analysing and plotting phylogenetic trees) can parse PhyloXML format, and they have some examples here. 
I also found some useful info. in a blog post by Connor Skennerton here.
To parse my file 'example.xml', I opened up python by typing: (using python2)
(Note to self: ete3 does not seem to work with Python3 for me on the Sanger farm)
% python2
Then in the Python command prompt:
>>> from ete3 import Phyloxml
>>> project = Phyloxml()
>>> project.build_from_file("example.xml")
>>> trees = project.get_phylogeny()
>>> for tree in trees:
...     print tree
...     for node in tree:
...         print "Node name:",

...         print "Species for node:", node.phyloxml_clade.taxonomy[0].get_common_name()
...         for seq in node.phyloxml_clade.get_sequence():
...             mol_seq = seq.get_mol_seq()
...             print "Sequence:", mol_seq.valueOf_

This gave output:

         /-|   \-BPAG_0000765501
        |  |
      /-|   \-WBA_0000831701
     |  |
   /-|   \-LOAG_10794
  |  |
--|   \-nLs.2.1.2.g06925
  |   /-WBGene00239008
Node name: WBGene00221255
Species for node: ['Brugia malayi (PRJNA10729)']
Node name: BPAG_0000765501
Species for node: ['Brugia pahangi (PRJEB497)']
Node name: WBA_0000831701
Species for node: ['Wuchereria bancrofti (PRJEB536)']
Node name: LOAG_10794
Species for node: ['Loa loa (PRJNA37757)']
Node name: nLs.2.1.2.g06925
Species for node: ['Litomosoides sigmodontis (PRJEB3075)']
Node name: WBGene00239008
Species for node: ['Onchocerca volvulus (PRJEB513)']
Node name: nDi.2.2.2.g04666
Species for node: ['Dirofilaria immitis (PRJEB1797)']

Note, that as Conor Skennerton points out, the documentation for the PhyloXML in ete3 is a bit sketchy, but you can find out the methods available for an object using the Python 'dir' function, e.g. to find the methods for a node object:
>>> dir(node)
['_PhyloNode__get_speciation_trees_recursive', '__add__', '__and__', '__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_asciiArt', '_children', '_diff', '_dist', '_get_children', '_get_dist', '_get_face_areas', '_get_farthest_and_closest_leaves', '_get_name', '_get_species', '_get_style', '_get_support', '_get_up', '_img_style', '_iter_descendants_levelorder', '_iter_descendants_postorder', '_iter_descendants_preorder', '_name', '_set_children', '_set_dist', '_set_face_areas', '_set_name', '_set_species', '_set_style', '_set_support', '_set_up', '_species', '_speciesFunction', '_support', '_up', 'add_child', 'add_face', 'add_feature', 'add_features', 'add_sister', 'annotate_ncbi_taxa', 'build', 'buildChildren', 'check_monophyly', 'children', 'collapse_lineage_specific_expansions', 'compare', 'convert_to_ultrametric', 'copy', 'del_feature', 'delete', 'describe', 'detach', 'dist', 'expand_polytomies', 'export', 'faces', 'features', 'from_parent_child_table', 'from_skbio', 'get_age', 'get_age_balanced_outgroup', 'get_ancestors', 'get_ascii', 'get_cached_content', 'get_children', 'get_closest_leaf', 'get_common_ancestor', 'get_descendant_evol_events', 'get_descendants', 'get_distance', 'get_edges', 'get_farthest_leaf', 'get_farthest_node', 'get_farthest_oldest_leaf', 'get_farthest_oldest_node', 'get_leaf_names', 'get_leaves', 'get_leaves_by_name', 'get_midpoint_outgroup', 'get_monophyletic', 'get_my_evol_events', 'get_sisters', 'get_speciation_trees', 'get_species', 'get_topology_id', 'get_tree_root', 'img_style', 'is_leaf', 'is_root', 'iter_ancestors', 'iter_descendants', 'iter_edges', 'iter_leaf_names', 'iter_leaves', 'iter_prepostorder', 'iter_search_nodes', 'iter_species', 'ladderize', 'link_to_alignment', 'name', 'ncbi_compare', 'phonehome', 'phyloxml_clade', 'phyloxml_phylogeny', 'populate', 'prune', 'reconcile', 'remove_child', 'remove_sister', 'render', 'resolve_polytomy', 'robinson_foulds', 'search_nodes', 'set_outgroup', 'set_species_naming_function', 'set_style', 'show', 'sort_descendants', 'species', 'split_by_dups', 'standardize', 'support', 'swap_children', 'traverse', 'unroot', 'up', 'write']

Added 9-Mar-2022: Parsing an XML format phylogenetic tree using BioPython
I've found that I now seem to have some problems parsing an XML tree using ETE, it throws an error which I don't seem to be able to correct:
% export PYTHONPATH=/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/
% python2
Then in the Python command prompt:
>>> from ete3 import Phyloxml
>>> project = Phyloxml()
>>> project.build_from_file("example.xml")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/ete3/phyloxml/", line 61, in build_from_file
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/ete3/phyloxml/", line 464, in build
    self.buildChildren(child, node, nodeName_)
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/ete3/phyloxml/", line 470, in buildChildren
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/ete3/phyloxml/", line 120, in build
    self.phyloxml_phylogeny.buildAttributes(node, node.attrib, [])
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/ete3/phyloxml/", line 715, in buildAttributes
    value = find_attr_value_('rerootable', node)
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/ete3/phyloxml/", line 281, in find_attr_value_
    namespaces = six.itervalues(node.nsmap)
AttributeError: nsmap

I'm not sure why I get this error, but I found that I can instead use BioPython to parse the PhyloXML trees:
% python2
Then in the Python command prompt, get the names of genes in the tree, their sequences, their species, and the identifier for the whole tree:
>>> from Bio import Phylo
>>> tree ="example.xml", "phyloxml")
>>> for clade in tree.get_terminals():
...         print
>>> for clade in tree.get_terminals():
...         sequences_list = clade.sequences
...         sequence = sequences_list[0]
...         molseq = sequence.mol_seq 
...         print(molseq)
>>>  for clade in tree.get_terminals():
...         taxonomy_list = clade.taxonomies
...         taxonomy = taxonomy_list[0]
...         species = taxonomy.scientific_name
...         print(species)
Brugia malayi strain FR3
Brugia pahangi strain Scotland/Glasgow
Litomosoides sigmodontis strain Bain lab strain
Onchocerca volvulus strain O. volvulus Cameroon isolate
Dirofilaria immitis

>>> tree_properties =
>>> print(tree_properties[0]).value

Wednesday 12 June 2019

Clustering compounds using DataWarrior

I wanted to cluster some chemical compounds (which I had stored in SMILES format), and had read about the free cheminformatics software DataWarrior, so decided to give it a try!

After downloading and installing DataWarrior on my Mac, I had a read through the DataWarrior user manual.

Reading in my data
I wanted to read in a tab-delimited text file in which the first column contains molecule name and the second column contains SMILES.
I made a little file with some test data, with two columns, the first with compound name, and the second with SMILES:
Nemadectin    C[C@@H]\1C/C(=C/C[C@@H]2C[C@@H](C[C@@]3(O2)C[C@@H]([C@@H]([C@H](O3)/C(=C/C(C)C)/C)C)O)OC(=O)[C@@H]4C=C([C@H]([C@@H]5[C@]4(/C(=C/C=C1)/CO5)O)O)C)/C
Hexachloroethane    C(C(Cl)(Cl)Cl)(Cl)(Cl)Cl
Tetrachloroethylene    C(=C(Cl)Cl)(Cl)Cl
N-Butyl chloride    CCCCCl
Fumazone    C(C(CBr)Br)Cl

I then went to the File->Open menu in DataWarrior, and voila, my data loaded into DataWarrior!

Clustering compounds
I next wanted to cluster my compounds by similarity. To do this, I selected all my compounds in the table at the 'Table' window (see top left) in DataWarrior, and then chose 'Cluster compounds/reactions' in the 'Chemistry' menu. 

This didn't produce any plot, but I noticed that two extra columns had been added to the 'Table' window, 'Cluster No' and 'Is Representative', saying what is the cluster number for each compound, and whether it is a good representative for the cluster.

Similarity analysis
Another type of analysis you can perform in DataWarrior is a 'similarity analysis'. To do this, I went to 'Chemistry'->'Analyse similarity/activity cliffs'. This brought up a new window called a 'Similarity chart', which shows my molecules, with similar compounds connected by a line. Also, when you click on the dot representing a compound, other molecules that are similar are coloured according to their level of similarity.

For example, if I click on 'ivermectin', I see the dot representing ivermectin becomes dark green, and other compounds that are similar are now coloured green:


In another nice example, I see a trio of compounds that are coloured green at the bottom (with lines between them), but another greenish compound somewhere on the left which is not joined by a line but is somewhat similar:

Note that if you move your mouse over a dot (compound) in the 'Similarity chart' window, the compound will appear in the window at the bottom left of the screen:

Note that it's also possible to hover over, or click on, a compound in the 'Table' of compounds at the top left of DataWarrior, and that compound will be highlighted in the 'Similarity chart' window, e.g. 
for euphorbol:

Note that if you want to export the connected components found by DataWarrior in this Similarity Analysis, you can go to File->Save special, and choose 'Textfile', and for each compound the text file has its number of neighbours, and the node numbers for the neighbours in the Similarity Analysis plot. You can then use that information to figure out the members of each connected component.