Wednesday 29 May 2019

Retrieving data from ChEMBL using their web interface

The ChEMBL database is a wonderfully useful database of bioactivity data for compounds and drugs, and their targets (e.g. protein targets, protein complex targets, whole organism targets, etc.).

Simple queries of ChEMBL via the web
The ChEMBL team have created a RESTful web interface, which means that you can query their data through the web.

For example, to retrieve all the compounds that have bioactivities against the ChEMBL targets CHEMBL1848 and CHEMBL3394, you can type in your web browser :
https://www.ebi.ac.uk/chembl/api/data/activity.json?target_chembl_id__in=CHEMBL1848,CHEMBL3394&assay_type=B&pchembl_value__gte=5&limit=100
This should bring you back the first 100 compounds/drugs that have bioactivities against these two ChEMBL targets.


The example above uses 'activity.json' to query activity data. There is an API for their web interface here that tells you all the query types you can perform.

Another example is to use 'molecule.json' to query data on properties of molecules, to get all the properties for molecules CHEMBL1627445 and CHEMBL43600:
https://www.ebi.ac.uk/chembl/api/data/molecule.json?molecule_chembl_id__in=CHEMBL1627445,CHEMBL43600&limit=100
















Simple queries of ChEMBL using Python, in a Jupyter notebook
If you are familiar with Python, an even easier way to query ChEMBL via the web is to write the queries within Python.

Fiona Hunter from ChEMBL kindly provided me with an example Jupyter notebook to do this, which has an example of querying ChEMBL to find the compounds with bioactivities for certain ChEMBL targets, and then to retrieve information on the properties of those compounds. You can see the Jupyter notebook here.

The Jupyter notebook allows you to type in commands and get an instant response, and looks like this:


Note: if you are new to Jupyter notebooks, you can look at my previous post.
Note2: if you change a parameter in this Jupyter notebook, and then it gives you an error message when you run it, it might be because you have already run steps ahead of the step you are currently editing, and this means the notebook has got confused by the variables (e.g. the variable 'mol_df' is filtered down to fewer columns at some stage, so if you go back to run an earlier step, the notebook might get confused because 'mol_df' doesn't have all the columns it expects for that earlier step). If this happens, don't worry, you just need to go back to the start of the notebook, and run each step again at a time (using SHIFT+RETURN).

Simple queries of Python using a traditional Python script
The Jupyter notebook above is quite nice because it is interactive, so you can fiddle with parameters and rerun it easily, but if you want to retrieve lots of data from ChEMBL, it's probably best to use a traditional Python script. I've rewritten the Jupyter notebook above as a traditional Python script, see here.

Some extra notes
- To extract additional columns on things such as oral/parenteral/topic delivery, black box warnings, we can type:
mol_df = mol_df[[ 'molecule_chembl_id','pref_name', 'molecule_hierarchy'
                 , 'molecule_properties', 'max_phase', 'availability_type', 'oral', 'topical', 'parenteral', 'black_box_warning']]


- To extract the QED score, we can type:
mol_df['qed_weighted'] = mol_df.loc[ mol_df['molecule_properties'].notnull(), 'molecule_properties'].apply(lambda x: x['qed_weighted'])

- To extract the SMILES, we can type:
mol_df = mol_df[[ 'molecule_chembl_id','pref_name', 'molecule_hierarchy'
                 , 'molecule_properties', 'max_phase', 'availability_type', 'oral', 'topical'
                 , 'parenteral', 'black_box_warning', 'molecule_structures']]

mol_df['canonical_smiles'] = mol_df.loc[ mol_df['molecule_structures'].notnull(), 'molecule_structures'].apply(lambda x: x['canonical_smiles'])

- There is a nice picture of the ChEMBL database schema available on the ChEM|BL ftp site here.

Acknowledgements
Thank you to Fiona Hunter from ChEMBL for invaluable advice on getting started on querying ChEMBL via their web interface.


No comments: