Friday 31 May 2019

Retrieving data from WormBase ParaSite using their web interface

I've recently been learning how to query the ChEMBL database via the web using their REST API (see my blog post here), and to query the PDBe database via the web using their REST API (see my blog post here).

So challenge for today: learn how to query the WormBase ParaSite database via the web, using their REST API!

Simple queries of WormBase ParaSite via the web
There is a very nice description of the types of queries you can perform on WormBase ParaSite using the REST API, here

This has examples of code you can type directly into a browser, as well as the code you would need to perform the same queries in programming languages such as Perl or Python.

For example, here is their documentation of the query you can use to retrieve the gene tree that a particular gene of interest belongs to: genetree_member_id documentation.

For example, if I want to find out what gene tree the Schistosoma mansoni gene SULT-OR (the protein involved in resistance to the drug oxamniquine), which has the identifier Smp_089320, belongs to, we can type in the web browser: https://parasite.wormbase.org/rest-13/genetree/member/id/Smp_089320?content-type=text/x-phyloxml%2Bxml

We should get back a gene tree in XML format, looking something like this:



By default, this will return the original sequence for each gene in the gene tree, but you can tell it to instead return the alignment for each gene in the gene tree by using the aligned=1 option:https://parasite.wormbase.org/rest-13/genetree/member/id/Smp_089320?aligned=1&content-type=text/x-phyloxml%2Bxml

Then the output should have aligned sequence for each gene (ie. with indels):




Simple queries of WormBase ParaSite using Python
A nice thing about the WormBase ParaSite REST API documentation (here) is that for each query, it gives example code in Python and Perl.

So for example, to get the gene tree containing a particular gene Smp_089320, in Python3 you would need:

import requests, sys

server = "https://parasite.wormbase.org"
ext = "/rest-13/lookup/genome/brugia_malayi_prjna10729?biotypes=ncRNA"

r = requests.get(server+ext, headers={ "Content-Type" : "application/json", "Accept" : ""})

if not r.ok:
    r.raise_for_status()
    sys.exit()

decoded = r.json()
print("number of genes=",len(decoded))

# go through the genes and print them out one at a time:
cnt = 0
for gene in decoded:
    cnt += 1
    print(cnt,"name=",gene["name"])


Here are some Python scripts I wrote to query WormBase ParaSite:
- a script to get a list of all Schistosoma mansoni protein-coding genes: see here
- a script to get, and parse, all gene trees that contain an input list of S. mansoni genes: see here

Other notes
- There's also a flat file of all orthologs for S. mansoni, but without alignments (only % identities) on the WormBase ParaSite FTP site:

ftp://ftp.ebi.ac.uk/pub/databases/wormbase/parasite/releases/WBPS13/species/schistosoma_mansoni/PRJEA36577/schistosoma_mansoni.PRJEA36577.WBPS13.orthologs.tsv.gz



Acknowledgements

Thank you so much to Wojtek Bazant for all his great help to me, showing me how to use the WormBase ParaSite REST API!