Wednesday, 11 December 2019

Statistical rethinking: grid approximation for estimating a posterior

My colleague James Cotton recommended that a good way to expand my stats knowledge is to read the book Statistcal Rethinking by Richard McElreath, which so far I'm enjoying a lot!

A nice aspect of the book is that he uses many examples in R. 

Grid approximation for estimating a posterior distribution
In chapter 2, he shows how to estimate a posterior distribution using a 'grid approximation' approach. I've wrapped up his lines of code into a little function that lets you try out different number of grid points.

In this case, we are interested in calculating the posterior distribution for p, the probability of success for a binomial variable.

We can specifiy different prior distributions for p.

Let's assume that we have carried out 9 Bernoulli trials and observed 6 successes so far. 

Let's try a uniform prior:

plot_posterior_using_uniform_prior <- function(num_points_in_grid)
    p_grid <- seq(from = 0, to=1, length.out=num_points_in_grid)
    prior <- rep(1,num_points_in_grid)
    likelihood <- dbinom(6, size=9, prob=p_grid)
    ustd.posterior <- likelihood * prior
    posterior <- ustd.posterior/sum(ustd.posterior)
    plot(p_grid, posterior, type="b", xlab="probability of success", ylab="posterior probability for p")

Do a plot with a grid with 30 values for p:

Let's try 10 values for p:
This doesn't give us such a good estimate of the posterior distribution for p:

What if we had a different prior, that was a step function, assigning 0 probability to all values of p less than 0.5:

plot_posterior_using_step_prior <- function(num_points_in_grid)
     p_grid <- seq(from = 0, to=1, length.out=num_points_in_grid) 
     prior <- ifelse ( p_grid < 0.5, 0, 1)
     likelihood <- dbinom(6, size=9, prob=p_grid) 
     ustd.posterior <- likelihood * prior 
     posterior <- ustd.posterior/sum(ustd.posterior) 
     plot(p_grid, posterior, type="b", xlab="probability of success", ylab="posterior probability for p")

Quadratic approximation for estimating a posterior distribution 
Chapter 2 of Statistical Rethinking also explains that the grid approximation approach doesn't scale well and is slow, and that a faster, more scaleable approach is quadratic approximation. McElreath shows how it can be used to estimate the posterior distribution for the same parameter p, as above: (see his book for how to install the 'rethinking' R package):

library(rethinking) <- quap(alist(W ~ dbinom(W+L,p), p ~ dunif(0,1)), data=list(W=6,L=3)
+ )
  mean   sd 5.5% 94.5%
p 0.67 0.16 0.42  0.92
This tells us that, using a uniform prior and assuming the posterior for p is Gaussian (the key assumption of the quadratic approximation approach), its maximum (peak) is at 0.67 and its standard deviation is 0.16.
This is similar to the peak of the posterior distribution for p that we estimated using the grid approximation approach above (when using a large number of grid points such as 30).

Friday, 15 November 2019

Drawing a molecule using CDK Depict

Before I wrote a blogpost on how to draw a molecule using OpenBabel.

Another lovely way to draw a molecule is using CDK Depict.

I can paste in a SMILES string, eg. for the drug tegaserod 'CCCCCNC(=N)N\N=C\c1c[nH]c2ccc(OC)cc12', and get a picture:

I prefer to have the picture in colour on a white background, and no abbreviations:


Monday, 7 October 2019

Installing OpenBabel

I previously wrote a little blogpost about the OpenBabel software for cheminformatics.

This week I decided to install the latest OpenBabel on my Mac laptop.

Here's what I did:

--- The latest instructions for installation from the OpenBabel website are here.

--- I downloaded the latest development version from github here. (Note: I first tried to install the latest release, but there was a problem, so instead I found I could install the development version, which is newer.)

--- I then did this on my Mac laptop, in an terminal window:
     % mkdir /Users/alc/Documents/Software/openbabel
     % cd /Users/alc/Documents/Software/openbabel
     % mv ~alc/Downloads/openbabel-openbabel-3-0-0a2.tar.gz .
     % gunzip openbabel-openbabel-3-0-0a2.tar.gz
     % tar xvf openbabel-openbabel-3-0-0a2.tar
     % mkdir ob-build
     % mv openbabel-openbabel-3-0-0a2 ob-src

--- I then installed CMake as it wasn't on my Mac laptop already (Note to self: I downloaded the Mac dmg from and asked Systems to install it).

--- In the CMake GUI on my Mac laptop, I then set:
     'where is the source code' to /Users/alc/Documents/Software/openbabel/ob-src
    'where to build the binaries' to /Users/alc/Documents/Software/openbabel/ob-build
    and CMAKE_INSTALL_PREFIX to /Users/alc/Documents/Software/openbabel-install
(Note: setting the CMAKE_INSTALL_PREFIX variable like this means that you can make a local installation of OpenBabel even if you are don't have the administrator password for the computer.)


--- I then clicked 'Configure' in Cmake and then clicked 'Generate' in Cmake. This makes a makefile for building OpenBabel.

---  Back in the terminal window on my Mac laptop, I typed:
% cd ob-build
% make
% make install

--- Then I tried running Openbabel:
% /Users/alc/Documents/Software/openbabel/ob-build/bin/obabel
No input file or format spec or possibly a misplaced option.
Most options must come after the input files. (-i -o -O -m can be anywhwere.)

Open Babel 3.0.0a2 -- Oct  1 2019 -- 10:22:49
obabel [-i<input-type>] <infilename> [-o<output-type>] -O<outfilename> [Options]
Try  -H option for more information.

Hurray!!! It worked : )

Now let's try some OpenBabel magic:
% /Users/alc/Documents/Software/openbabel/ob-build/bin/obabel -:"c1ccccc1C(=O)Cl" -oascii

  Cl                                   O                                      
         __                   _/                                              
           \_              __/   __                                           
             \__        __/    _/                                             
                \_    _/    __/                                               
                  \__/    _/                                                  
                    \_  _/                                                    
                   _/  \_                                                     
                __/      \__                                                  
              _/      __    \__                                               
           __/          \_     \_                                             
        __/               \__    \__                                          
      _/                     \__    \_                                        
   __/                          \_    \_                                      
   |                              \_   |                                      
   |                                   |                                      
   |  |                                |                                      
   |  |                                |                                      
   |  |                                |                                      
   |  |                                |                                      
   |  |                                |                                      
   |  |                                |                                      
   |  |                                |                                      
   |  |                                |                                      
   |  |                                |                                      
   |  |                                |                                      
   |                              __   |                                      
   _                            _/    __                                      
    \_                       __/    _/                                        
      \__                 __/    __/                                          
         \_             _/    __/                                             
           \__        _/    _/                                                
              \__        __/                                                  
                 \_    _/                                                     
1 molecule converted

Monday, 8 July 2019

Heatmaps and R

Ages ago I wrote a blogpost on heatmaps in R, but that was focussing mainly on clustering and dendrograms.

This week I wanted to draw a simple heatmap and found nice tutorial here.

My input data looks something like this (in file 'mydata'):

To make a heatmap I did this:
> data <- read.csv("mydata")
> rnames <- data[,1]
> mat_data <- data.matrix(data[,2:ncol(data)])
> rownames(mat_data) <- rnames
> my_palette <- colorRampPalette(c("red", "yellow", "green"))(n = 299)
> library(gplots)
> heatmap.2(mat_data,trace="none",col=my_palette, dendrogram="none", margins=c(12,9))

This gives a nice heatmap like this:

Friday, 28 June 2019

Retrieving data from WormBase using the REST API

I recently wrote a blog post on using the WormBase ParaSite REST API to retrieve data from their database. For C. elegans data, it's best to use the main WormBase website, and there is also a WormBase REST API.

My problem today was to retrieve phenotype information (from RNAi, mutants) for C. elegans genes. Ages ago, I wrote a blog post doing this by querying the WormBase website through their standard webpage. But, let's modernise and use the REST API instead!

I found that it's possible to do a REST API query to get the phenotypes for a gene, by typing for example:
for gene WBGene00000079.

Python scripts to retrieve phenotypes for a gene, or a list of genes
I also wrote a Python script to do the same query using Python, see here. When you run it you should see:
% python3
id= WBPhenotype:0001952 label= germline nuclear positioning variant
id= WBPhenotype:0001969 label= germ cell compartment morphology variant
id= WBPhenotype:0000186 label= oogenesis variant
id= WBPhenotype:0000640 label= egg laying variant
id= WBPhenotype:0001940 label= rachis morphology variant
id= WBPhenotype:0001980 label= germ cell compartment expansion variant
id= WBPhenotype:0000638 label= molt defect
id= WBPhenotype:0000154 label= reduced brood size
id= WBPhenotype:0001973 label= germ cell compartment size variant
id= WBPhenotype:0000059 label= larval arrest
id= WBPhenotype:0000697 label= protruding vulva

Then I wrote another Python script that retrieves all the phenotypes for an input list of genes: see here.
For example, for an input list of C.elegans genes:
1   WBGene00000079
2   WBGene00001484
3   WBGene00001948
you can run it by typing:
% python3 mytmp mytmp.out
and the output is:
WBGene00000079  WBPhenotype:0001952     germline nuclear positioning variant
WBGene00000079  WBPhenotype:0001969     germ cell compartment morphology variant
WBGene00000079  WBPhenotype:0000186     oogenesis variant
WBGene00000079  WBPhenotype:0000640     egg laying variant
WBGene00000079  WBPhenotype:0001940     rachis morphology variant
WBGene00000079  WBPhenotype:0001980     germ cell compartment expansion variant
WBGene00000079  WBPhenotype:0000638     molt defect
WBGene00000079  WBPhenotype:0000154     reduced brood size
WBGene00000079  WBPhenotype:0001973     germ cell compartment size variant
WBGene00000079  WBPhenotype:0000059     larval arrest
WBGene00000079  WBPhenotype:0000697     protruding vulva
WBGene00001484  WBPhenotype:0000640     egg laying variant
WBGene00001948  WBPhenotype:0000062     lethal
WBGene00001948  WBPhenotype:0000054     larval lethal
WBGene00001948  WBPhenotype:0000867     embryonic arrest
WBGene00001948  WBPhenotype:0000643     locomotion variant
WBGene00001948  WBPhenotype:0000050     embryonic lethal
WBGene00001948  WBPhenotype:0000053     paralyzed arrested elongation two fold
WBGene00001948  WBPhenotype:0000535     organism morphology variant
WBGene00001948  WBPhenotype:0000406     lumpy
WBGene00001948  WBPhenotype:0000583     dumpy
WBGene00001948  WBPhenotype:0000688     sterile
WBGene00001948  WBPhenotype:0000669     sex muscle development variant
WBGene00001948  WBPhenotype:0000154     reduced brood size
WBGene00001948  WBPhenotype:0000861     body wall muscle development variant
WBGene00001948  WBPhenotype:0000861     body wall muscle development variant
WBGene00001948  WBPhenotype:0000861     body wall muscle development variant
WBGene00001948  WBPhenotype:0001913     excess coelomocytes
WBGene00001948  WBPhenotype:0000095     M lineage variant
WBGene00001948  WBPhenotype:0000031     slow growth

Python script to retrieve references for a gene

Here is a little script to retrieve the references (e.g. papers) for a gene:
import requests, sys

server = ""
ext = "/rest/field/gene/WBGene00000079/references"

r = requests.get(server+ext, headers={ "Content-Type" : "application/json", "Accept" : ""})

if not r.ok:

decoded = r.json()
# print(decoded)

# based on looking at the example
refs = decoded["references"]
refs = refs["data"]
refs = refs["results"]
refcnt = 0
for ref in refs:
    refcnt += 1
    title = ref["title"]
    title = title[0]
    abstract = ref["abstract"]
    abstract = abstract[0]
    title_words = title.split()
    abstract_words = abstract.split()
    if 'development' in title_words:


Retrieving the gene name for a gene

You can retrieve the gene name for a gene by using the REST API e.g. .

Wednesday, 26 June 2019

Target prediction in ChEMBL

I recently wrote a blogpost on using the ChEMBL REST API to retrieve compounds with bioactivities for particular targets, as well as compound information (structure, logP etc.) for those compounds, from the ChEMBL database.

What about retrieving information on which (ChEMBL) targets are predicted for a particular ChEMBL compound? If you look at the ChEMBL page for a compound (e.g. imatinib),  at the bottom of the page you can see a 'Target predictions' section, containing predicted protein targets in ChEMBL. These are based on the prediction approach described here and here. These predictions give a good first approximation of whether there is a predicted binding between a compound and a ChEMBL target.

Simple REST API queries through the ChEMBL website
It turns out this is easy to retrieve this target prediction info. using the ChEMBL REST API (just for compounds and targets already in ChEMBL). For example, to get target predictions for imatinib, we can type into a web browser:
and we should see:

The fields shown here are:
in_training: was the predicted compound in the training set? yes: 1 / no: 0.
molecule_chembl_id: id of the predicted compound (presumably a parent chembl id)
pred_id: id of the prediction for the target - compound pair
probability: probability that the compound is active on the target. It ranges from 0 (predicted not active) to 1 (predicted active)
target_accession: uniprot_id of the target
target_chembl_id: the ChEMBL id for the target
target_organism: the organism of the target
target_tax_id: NCBI taxonomy id of target
value: 1 or 10, because 2 models were built. One with an activity threshold of 1 uM and the other at 10 uM

Simple REST API queries using Python
I wrote a Python script to retrieve all the predicted targets from ChEMBL, for an input list of ChEMBL compounds. You can see it here.

Thank you to Fiona Hunter and Nicolas Bosc from ChEMBL for advice about the ChEMBL target prediction REST API!

Tuesday, 25 June 2019


The TRANSFAC database contains experimentally determined transcription factor binding sites, and is a rich source of information for those interested in promoters and gene regulation.

You can create a free account on their website, and get transcription factor binding sites for a transcription factor by searching.

Finding binding sites of muscle-specific transcription factors in TRANSFAC
For example, I am interested in muscle-specific transcription factors such as MyoD. I searched for them in TRANSFAC like this: (at the search page)

1. Once you log in, you can then click on 'Search the TRANSFAC Public database', and on the search page you can choose to search in the 'Factor' table, and then you can choose search term 'muscle', and select 'Cell specificity (positive)' in the 'Table field to search in' menu:

2. and this will bring up a list of muscle-specific transcription factors. If you click on a particular transcription factor (e.g. MyoD, or hlh1 (the C.elegans ortholog of MyoD)), it will give you details such as 'BS' (binding site) which tells you the binding site sequence. )

(I can't show the MyoD entry here as it's copyright info, but make yourself a free account and you can view it too!)

Searching for binding sites in a sequence using TRANSFAC
You can also use the TRANSFAC website to search for binding sites in your sequence of interest. Go to MATCH webpage, and paste in your sequence there, and press 'Submit' to search.

C'est belle...

Wednesday, 19 June 2019

Parsing an XML-format phylogenetic tree file

In a previous post, I wrote about retrieving data from WormBase ParaSite using their REST API.

I found that when you ask WormBase ParaSite for a phylogenetic tree, you can get it in JSON format or XML format. XML format seems a good idea as it is easy to parse phylogenetic trees in XML format. Here's an example: 

Retrieving an XML format phylogenetic tree from WormBase
First I retrieved the gene tree containing the gene WBGene00221255 from WormBase ParaSite by using the following Python script:
% python3 > example.xml
where the script is:

# script to retrieve the gene tree containing a particular gene from WormBase ParaSite
# example script taken from

import requests, sys

server = ""
ext = "/rest-13/genetree/member/id/WBGene00221255?"

r = requests.get(server+ext, headers={ "Content-Type" : "text/x-phyloxml+xml", "Accept" : ""})

if not r.ok:


This gave me an output xml file example.xml, that contains an XML-format phylogenetic tree:
 <?xml version="1.0" encoding="UTF-8"?>

<phyloxml xsi:schemaLocation="" xmlns:xsi="" x
  <phylogeny rooted="true" type="gene tree">
    <clade branch_length="0">
      <clade branch_length="0">
        <confidence type="bootstrap">46</confidence>
        <clade branch_length="0.056919">
          <confidence type="bootstrap">59</confidence>
          <clade branch_length="0.140118">
            <confidence type="bootstrap">35</confidence>
            <clade branch_length="0.048399">
              <confidence type="bootstrap">35</confidence>
              <clade branch_length="0.005286">
                  <scientific_name>Brugia malayi strain FR3</scientific_name>
                  <common_name>Brugia malayi (PRJNA10729)</common_name>
                  <accession source="Ensembl">Bm994.1</accession>

                <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">brugia_malayi_prjna10729</property>
              <clade branch_length="0.108384">
                  <scientific_name>Brugia pahangi strain Scotland/Glasgow</scientific_name>
                  <common_name>Brugia pahangi (PRJEB497)</common_name>
                  <accession source="Ensembl">BPAG_0000765501-mRNA-1</accession>
                <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">brugia_pahangi_prjeb497</property>
            <clade branch_length="0.046778">
                <scientific_name>Wuchereria bancrofti W_bancrofti_Jakarta_v2_0_4</scientific_name>
                <common_name>Wuchereria bancrofti (PRJEB536)</common_name>
                <accession source="Ensembl">WBA_0000831701-mRNA-1</accession>
              <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">wuchereria_bancrofti_prjeb536</property>
          <clade branch_length="0.158289">
              <scientific_name>Loa loa strain L. loa Cameroon isolate</scientific_name>
              <common_name>Loa loa (PRJNA37757)</common_name>
              <accession source="Ensembl">EFO17702.1</accession>
            <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">loa_loa_prjna37757</property>
        <clade branch_length="0.514023">
            <scientific_name>Litomosoides sigmodontis strain Bain lab strain</scientific_name>
            <common_name>Litomosoides sigmodontis (PRJEB3075)</common_name>
            <accession source="Ensembl">nLs.2.1.2.t06925-RA</accession>
          <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">litomosoides_sigmodontis_prjeb3075</property>
      <clade branch_length="0.109525">
        <confidence type="bootstrap">85</confidence>
        <clade branch_length="0.207304">
            <scientific_name>Onchocerca volvulus strain O. volvulus Cameroon isolate</scientific_name>
            <common_name>Onchocerca volvulus (PRJEB513)</common_name>
            <accession source="Ensembl">OVOC2199.2</accession>
          <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">onchocerca_volvulus_prjeb513</property>
        <clade branch_length="0.254764">
            <scientific_name>Dirofilaria immitis</scientific_name>
            <common_name>Dirofilaria immitis (PRJEB1797)</common_name>
            <accession source="Ensembl">nDi.2.2.2.t04666</accession>
          <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">dirofilaria_immitis_prjeb1797</property>
    <property datatype="xsd:string" ref="Compara:gene_tree_stable_id" applies_to="phylogeny">WBGT00000000028368</property>

When you look at the gene tree in WormBase ParaSite, it looks like this:

Parsing an XML format phylogenetic tree
The next step was to parse the phylogenetic tree. The tree is in the PhyloXML format. Luckily, it turns out that the ETE toolkit (a lovely toolkit for analysing and plotting phylogenetic trees) can parse PhyloXML format, and they have some examples here. 
I also found some useful info. in a blog post by Connor Skennerton here.
To parse my file 'example.xml', I opened up python by typing: (using python2)
(Note to self: ete3 does not seem to work with Python3 for me on the Sanger farm)
% python2
Then in the Python command prompt:
>>> from ete3 import Phyloxml
>>> project = Phyloxml()
>>> project.build_from_file("example.xml")
>>> trees = project.get_phylogeny()
>>> for tree in trees:
...     print tree
...     for node in tree:
...         print "Node name:",

...         print "Species for node:", node.phyloxml_clade.taxonomy[0].get_common_name()
...         for seq in node.phyloxml_clade.get_sequence():
...             mol_seq = seq.get_mol_seq()
...             print "Sequence:", mol_seq.valueOf_

This gave output:

         /-|   \-BPAG_0000765501
        |  |
      /-|   \-WBA_0000831701
     |  |
   /-|   \-LOAG_10794
  |  |
--|   \-nLs.2.1.2.g06925
  |   /-WBGene00239008
Node name: WBGene00221255
Species for node: ['Brugia malayi (PRJNA10729)']
Node name: BPAG_0000765501
Species for node: ['Brugia pahangi (PRJEB497)']
Node name: WBA_0000831701
Species for node: ['Wuchereria bancrofti (PRJEB536)']
Node name: LOAG_10794
Species for node: ['Loa loa (PRJNA37757)']
Node name: nLs.2.1.2.g06925
Species for node: ['Litomosoides sigmodontis (PRJEB3075)']
Node name: WBGene00239008
Species for node: ['Onchocerca volvulus (PRJEB513)']
Node name: nDi.2.2.2.g04666
Species for node: ['Dirofilaria immitis (PRJEB1797)']

Note, that as Conor Skennerton points out, the documentation for the PhyloXML in ete3 is a bit sketchy, but you can find out the methods available for an object using the Python 'dir' function, e.g. to find the methods for a node object:
>>> dir(node)
['_PhyloNode__get_speciation_trees_recursive', '__add__', '__and__', '__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_asciiArt', '_children', '_diff', '_dist', '_get_children', '_get_dist', '_get_face_areas', '_get_farthest_and_closest_leaves', '_get_name', '_get_species', '_get_style', '_get_support', '_get_up', '_img_style', '_iter_descendants_levelorder', '_iter_descendants_postorder', '_iter_descendants_preorder', '_name', '_set_children', '_set_dist', '_set_face_areas', '_set_name', '_set_species', '_set_style', '_set_support', '_set_up', '_species', '_speciesFunction', '_support', '_up', 'add_child', 'add_face', 'add_feature', 'add_features', 'add_sister', 'annotate_ncbi_taxa', 'build', 'buildChildren', 'check_monophyly', 'children', 'collapse_lineage_specific_expansions', 'compare', 'convert_to_ultrametric', 'copy', 'del_feature', 'delete', 'describe', 'detach', 'dist', 'expand_polytomies', 'export', 'faces', 'features', 'from_parent_child_table', 'from_skbio', 'get_age', 'get_age_balanced_outgroup', 'get_ancestors', 'get_ascii', 'get_cached_content', 'get_children', 'get_closest_leaf', 'get_common_ancestor', 'get_descendant_evol_events', 'get_descendants', 'get_distance', 'get_edges', 'get_farthest_leaf', 'get_farthest_node', 'get_farthest_oldest_leaf', 'get_farthest_oldest_node', 'get_leaf_names', 'get_leaves', 'get_leaves_by_name', 'get_midpoint_outgroup', 'get_monophyletic', 'get_my_evol_events', 'get_sisters', 'get_speciation_trees', 'get_species', 'get_topology_id', 'get_tree_root', 'img_style', 'is_leaf', 'is_root', 'iter_ancestors', 'iter_descendants', 'iter_edges', 'iter_leaf_names', 'iter_leaves', 'iter_prepostorder', 'iter_search_nodes', 'iter_species', 'ladderize', 'link_to_alignment', 'name', 'ncbi_compare', 'phonehome', 'phyloxml_clade', 'phyloxml_phylogeny', 'populate', 'prune', 'reconcile', 'remove_child', 'remove_sister', 'render', 'resolve_polytomy', 'robinson_foulds', 'search_nodes', 'set_outgroup', 'set_species_naming_function', 'set_style', 'show', 'sort_descendants', 'species', 'split_by_dups', 'standardize', 'support', 'swap_children', 'traverse', 'unroot', 'up', 'write']