Wednesday 19 June 2019

Parsing an XML-format phylogenetic tree file



In a previous post, I wrote about retrieving data from WormBase ParaSite using their REST API.

I found that when you ask WormBase ParaSite for a phylogenetic tree, you can get it in JSON format or XML format. XML format seems a good idea as it is easy to parse phylogenetic trees in XML format. Here's an example: 


Retrieving an XML format phylogenetic tree from WormBase
First I retrieved the gene tree containing the gene WBGene00221255 from WormBase ParaSite by using the following Python script:
% python3 retrieve_genetree_from_wormbase_parasite.py > example.xml
where the script retrieve_genetree_from_wormbase_parasite.py is:

# script to retrieve the gene tree containing a particular gene from WormBase ParaSite
# example script taken from https://parasite.wormbase.org/rest-13/documentation/info/genetree_member_id

import requests, sys

server = "https://parasite.wormbase.org"
ext = "/rest-13/genetree/member/id/WBGene00221255?"

r = requests.get(server+ext, headers={ "Content-Type" : "text/x-phyloxml+xml", "Accept" : ""})

if not r.ok:
    r.raise_for_status()
    sys.exit()

print(r.text)


This gave me an output xml file example.xml, that contains an XML-format phylogenetic tree:
 <?xml version="1.0" encoding="UTF-8"?>

<phyloxml xsi:schemaLocation="http://www.phyloxml.org http://www.phyloxml.org/1.10/phyloxml.xsd" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" x
mlns="http://www.phyloxml.org">
  <phylogeny rooted="true" type="gene tree">
    <clade branch_length="0">
      <taxonomy>
        <id>6296</id>
        <scientific_name>Onchocercidae</scientific_name>
      </taxonomy>
      <clade branch_length="0">
        <confidence type="bootstrap">46</confidence>
        <taxonomy>
          <id>6296</id>
          <scientific_name>Onchocercidae</scientific_name>
        </taxonomy>
        <clade branch_length="0.056919">
          <confidence type="bootstrap">59</confidence>
          <taxonomy>
            <id>6296</id>
            <scientific_name>Onchocercidae</scientific_name>
          </taxonomy>
          <clade branch_length="0.140118">
            <confidence type="bootstrap">35</confidence>
            <taxonomy>
              <id>6296</id>
              <scientific_name>Onchocercidae</scientific_name>
            </taxonomy>
            <clade branch_length="0.048399">
              <confidence type="bootstrap">35</confidence>
              <taxonomy>
                <id>6278</id>
                <scientific_name>Brugia</scientific_name>
              </taxonomy>
              <clade branch_length="0.005286">
                <name>WBGene00221255</name>
                <taxonomy>
                  <id>6279</id>
                  <scientific_name>Brugia malayi strain FR3</scientific_name>
                  <common_name>Brugia malayi (PRJNA10729)</common_name>
                </taxonomy>
                <sequence>
                  <accession source="Ensembl">Bm994.1</accession>
                  <location>Bm_007:479867-484217</location>
                  <mol_seq is_aligned="0">MQKQVKGQQHVNELLPSVIQQEDPTNDSLSPLSSARCTDQYQSDKLLAEIRNGVQLRHVIPNDHKQCAKMLYDKACEKKEENNINEKETKLETEKAINIGNILLNAMQKR
RLLMHFG</mol_seq>

                 </sequence>
                <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">brugia_malayi_prjna10729</property>
              </clade>
              <clade branch_length="0.108384">
                <name>BPAG_0000765501</name>
                <taxonomy>
                  <id>6280</id>
                  <scientific_name>Brugia pahangi strain Scotland/Glasgow</scientific_name>
                  <common_name>Brugia pahangi (PRJEB497)</common_name>
                </taxonomy>
                <sequence>
                  <accession source="Ensembl">BPAG_0000765501-mRNA-1</accession>
                  <location>BPAG_contig0008935:106-1027</location>
                  <mol_seq is_aligned="0">MRNFYIEIRNGVQLRHVIPNDHKQCAKMLYDKACEKKEENNINEKETKLETERAINIGNILLNAMHKRRLLMHSESSSSSDENINEQKLELWSDTDDNYDDNNDNDNNYN
DINKSMNKTQM</mol_seq>
                </sequence>
                <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">brugia_pahangi_prjeb497</property>
              </clade>
            </clade>
            <clade branch_length="0.046778">
              <name>WBA_0000831701</name>
              <taxonomy>
                <id>6293</id>
                <scientific_name>Wuchereria bancrofti W_bancrofti_Jakarta_v2_0_4</scientific_name>
                <common_name>Wuchereria bancrofti (PRJEB536)</common_name>
              </taxonomy>
              <sequence>
                <accession source="Ensembl">WBA_0000831701-mRNA-1</accession>
                <location>WBA_contig0002889:1472-2626</location>
                <mol_seq is_aligned="0">MQSQVKRQQHVDELLPPVLQQKDPTNDSLSPSSFARCTDQYQSDKLLAQIRSGVQLRHVIPNDHKQCTKMLYDKACEKVIISEENNINEKEVKLETEKTIDIGNILLNAMHK
RRLLMRFG</mol_seq>
              </sequence>
              <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">wuchereria_bancrofti_prjeb536</property>
            </clade>
          </clade>
          <clade branch_length="0.158289">
            <name>LOAG_10794</name>
            <taxonomy>
              <id>7209</id>
              <scientific_name>Loa loa strain L. loa Cameroon isolate</scientific_name>
              <common_name>Loa loa (PRJNA37757)</common_name>
            </taxonomy>
            <sequence>
              <accession source="Ensembl">EFO17702.1</accession>
              <location>JH712271:104276-105625</location>
              <mol_seq is_aligned="0">MKYAKGLAHKTALSGIRIQALLKKIEERQHIDELLPSNIPQNDQINETLSSSSSIHSIDQYQNNKLLAEIRSGVQLRHVVPNEHKQCTEMLYDKTSAKLFIP</mol_seq>
            </sequence>
            <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">loa_loa_prjna37757</property>
          </clade>
        </clade>
        <clade branch_length="0.514023">
          <name>nLs.2.1.2.g06925</name>
          <taxonomy>
            <id>42156</id>
            <scientific_name>Litomosoides sigmodontis strain Bain lab strain</scientific_name>
            <common_name>Litomosoides sigmodontis (PRJEB3075)</common_name>
          </taxonomy>
          <sequence>
            <accession source="Ensembl">nLs.2.1.2.t06925-RA</accession>
            <location>nLs.2.1.scaf00656:4793-6244</location>
            <mol_seq is_aligned="0">MSEKMPLDRAERIEQHKHELISTDPQQNNLISTNYSASSSSSPPSTQQQRQNDKLLTEINKGIHLRHVIPNAHKQCISMLYDKKCEKSSTTEKDSVKEDETKLEMEKSTDIGNVLL
NAMRKRRLLVEFESSSISNEDGNERRLEQWSDSDDTDSNAGKFRN</mol_seq>
          </sequence>
          <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">litomosoides_sigmodontis_prjeb3075</property>
        </clade>
      </clade>
      <clade branch_length="0.109525">
        <confidence type="bootstrap">85</confidence>
        <taxonomy>
          <id>6296</id>
          <scientific_name>Onchocercidae</scientific_name>
        </taxonomy>
        <clade branch_length="0.207304">
          <name>WBGene00239008</name>
          <taxonomy>
            <id>6282</id>
            <scientific_name>Onchocerca volvulus strain O. volvulus Cameroon isolate</scientific_name>
            <common_name>Onchocerca volvulus (PRJEB513)</common_name>
          </taxonomy>
          <sequence>
            <accession source="Ensembl">OVOC2199.2</accession>
            <location>OVOC_OM1b:16714030-16717575</location>
            <mol_seq is_aligned="0">MNDLSSSSHTDRYVDQNQNNKLLTEIRSRIHLRHVIPNEHKQCTKMLYDKTFDKAIISNEDNVSKKETKLDEQKRIHAENLLINAMRKRRFVMRFESSSSDENETEQELEH</mol
_seq>
          </sequence>
          <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">onchocerca_volvulus_prjeb513</property>
        </clade>
        <clade branch_length="0.254764">
          <name>nDi.2.2.2.g04666</name>
          <taxonomy>
            <id>6287</id>
            <scientific_name>Dirofilaria immitis</scientific_name>
            <common_name>Dirofilaria immitis (PRJEB1797)</common_name>
          </taxonomy>
          <sequence>
            <accession source="Ensembl">nDi.2.2.2.t04666</accession>
            <location>nDi.2.2.scaf00098:142364-144204</location>
            <mol_seq is_aligned="0">MQKVNKAAISDKQIKALLKRISQKEQVREFHAINIQGENSTNDQSLSSSSNHCNDQYQYDKLLTEIRNGIELRHVIPNEHKRCAKMLYDKTIEELFISGEDKINDKQIKLCEQKST
DDIDIENILINAMCKRRFAMRFESSSSDENVSGRELEHWYDADDGNDDE</mol_seq>
          </sequence>
          <property datatype="xsd:string" ref="Compara:genome_db_name" applies_to="clade">dirofilaria_immitis_prjeb1797</property>
        </clade>
      </clade>
    </clade>
    <property datatype="xsd:string" ref="Compara:gene_tree_stable_id" applies_to="phylogeny">WBGT00000000028368</property>
  </phylogeny>
</phyloxml>


When you look at the gene tree in WormBase ParaSite, it looks like this:


Parsing an XML format phylogenetic tree using ETE
The next step was to parse the phylogenetic tree. The tree is in the PhyloXML format. Luckily, it turns out that the ETE toolkit (a lovely toolkit for analysing and plotting phylogenetic trees) can parse PhyloXML format, and they have some examples here. 
I also found some useful info. in a blog post by Connor Skennerton here.
To parse my file 'example.xml', I opened up python by typing: (using python2)
(Note to self: ete3 does not seem to work with Python3 for me on the Sanger farm)
% python2
Then in the Python command prompt:
>>> from ete3 import Phyloxml
>>> project = Phyloxml()
>>> project.build_from_file("example.xml")
>>> trees = project.get_phylogeny()
>>> for tree in trees:
...     print tree
...     for node in tree:
...         print "Node name:", node.name

...         print "Species for node:", node.phyloxml_clade.taxonomy[0].get_common_name()
...         for seq in node.phyloxml_clade.get_sequence():
...             mol_seq = seq.get_mol_seq()
...             print "Sequence:", mol_seq.valueOf_
...


This gave output:

               /-WBGene00221255
            /-|
         /-|   \-BPAG_0000765501
        |  |
      /-|   \-WBA_0000831701
     |  |
   /-|   \-LOAG_10794
  |  |
--|   \-nLs.2.1.2.g06925
  |
  |   /-WBGene00239008
   \-|
      \-nDi.2.2.2.g04666
Node name: WBGene00221255
Species for node: ['Brugia malayi (PRJNA10729)']
Sequence: MQKQVKGQQHVNELLPSVIQQEDPTNDSLSPLSSARCTDQYQSDKLLAEIRNGVQLRHVIPNDHKQCAKMLYDKACEKKEENNINEKETKLETEKAINIGNILLNAMQKRRLLMHFG
Node name: BPAG_0000765501
Species for node: ['Brugia pahangi (PRJEB497)']
Sequence: MRNFYIEIRNGVQLRHVIPNDHKQCAKMLYDKACEKKEENNINEKETKLETERAINIGNILLNAMHKRRLLMHSESSSSSDENINEQKLELWSDTDDNYDDNNDNDNNYNDINKSMNKTQM
Node name: WBA_0000831701
Species for node: ['Wuchereria bancrofti (PRJEB536)']
Sequence: MQSQVKRQQHVDELLPPVLQQKDPTNDSLSPSSFARCTDQYQSDKLLAQIRSGVQLRHVIPNDHKQCTKMLYDKACEKVIISEENNINEKEVKLETEKTIDIGNILLNAMHKRRLLMRFG
Node name: LOAG_10794
Species for node: ['Loa loa (PRJNA37757)']
Sequence: MKYAKGLAHKTALSGIRIQALLKKIEERQHIDELLPSNIPQNDQINETLSSSSSIHSIDQYQNNKLLAEIRSGVQLRHVVPNEHKQCTEMLYDKTSAKLFIP
Node name: nLs.2.1.2.g06925
Species for node: ['Litomosoides sigmodontis (PRJEB3075)']
Sequence: MSEKMPLDRAERIEQHKHELISTDPQQNNLISTNYSASSSSSPPSTQQQRQNDKLLTEINKGIHLRHVIPNAHKQCISMLYDKKCEKSSTTEKDSVKEDETKLEMEKSTDIGNVLLNAMRKRRLLVEFESSSISNEDGNERRLEQWSDSDDTDSNAGKFRN
Node name: WBGene00239008
Species for node: ['Onchocerca volvulus (PRJEB513)']
Sequence: MNDLSSSSHTDRYVDQNQNNKLLTEIRSRIHLRHVIPNEHKQCTKMLYDKTFDKAIISNEDNVSKKETKLDEQKRIHAENLLINAMRKRRFVMRFESSSSDENETEQELEH
Node name: nDi.2.2.2.g04666
Species for node: ['Dirofilaria immitis (PRJEB1797)']
Sequence: MQKVNKAAISDKQIKALLKRISQKEQVREFHAINIQGENSTNDQSLSSSSNHCNDQYQYDKLLTEIRNGIELRHVIPNEHKRCAKMLYDKTIEELFISGEDKINDKQIKLCEQKSTDDIDIENILINAMCKRRFAMRFESSSSDENVSGRELEHWYDADDGNDDE

Note, that as Conor Skennerton points out, the documentation for the PhyloXML in ete3 is a bit sketchy, but you can find out the methods available for an object using the Python 'dir' function, e.g. to find the methods for a node object:
>>> dir(node)
['_PhyloNode__get_speciation_trees_recursive', '__add__', '__and__', '__bool__', '__class__', '__contains__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_asciiArt', '_children', '_diff', '_dist', '_get_children', '_get_dist', '_get_face_areas', '_get_farthest_and_closest_leaves', '_get_name', '_get_species', '_get_style', '_get_support', '_get_up', '_img_style', '_iter_descendants_levelorder', '_iter_descendants_postorder', '_iter_descendants_preorder', '_name', '_set_children', '_set_dist', '_set_face_areas', '_set_name', '_set_species', '_set_style', '_set_support', '_set_up', '_species', '_speciesFunction', '_support', '_up', 'add_child', 'add_face', 'add_feature', 'add_features', 'add_sister', 'annotate_ncbi_taxa', 'build', 'buildChildren', 'check_monophyly', 'children', 'collapse_lineage_specific_expansions', 'compare', 'convert_to_ultrametric', 'copy', 'del_feature', 'delete', 'describe', 'detach', 'dist', 'expand_polytomies', 'export', 'faces', 'features', 'from_parent_child_table', 'from_skbio', 'get_age', 'get_age_balanced_outgroup', 'get_ancestors', 'get_ascii', 'get_cached_content', 'get_children', 'get_closest_leaf', 'get_common_ancestor', 'get_descendant_evol_events', 'get_descendants', 'get_distance', 'get_edges', 'get_farthest_leaf', 'get_farthest_node', 'get_farthest_oldest_leaf', 'get_farthest_oldest_node', 'get_leaf_names', 'get_leaves', 'get_leaves_by_name', 'get_midpoint_outgroup', 'get_monophyletic', 'get_my_evol_events', 'get_sisters', 'get_speciation_trees', 'get_species', 'get_topology_id', 'get_tree_root', 'img_style', 'is_leaf', 'is_root', 'iter_ancestors', 'iter_descendants', 'iter_edges', 'iter_leaf_names', 'iter_leaves', 'iter_prepostorder', 'iter_search_nodes', 'iter_species', 'ladderize', 'link_to_alignment', 'name', 'ncbi_compare', 'phonehome', 'phyloxml_clade', 'phyloxml_phylogeny', 'populate', 'prune', 'reconcile', 'remove_child', 'remove_sister', 'render', 'resolve_polytomy', 'robinson_foulds', 'search_nodes', 'set_outgroup', 'set_species_naming_function', 'set_style', 'show', 'sort_descendants', 'species', 'split_by_dups', 'standardize', 'support', 'swap_children', 'traverse', 'unroot', 'up', 'write']

Added 9-Mar-2022: Parsing an XML format phylogenetic tree using BioPython
I've found that I now seem to have some problems parsing an XML tree using ETE, it throws an error which I don't seem to be able to correct:
% export PYTHONPATH=/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/
% python2
Then in the Python command prompt:
>>> from ete3 import Phyloxml
>>> project = Phyloxml()
>>> project.build_from_file("example.xml")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/ete3/phyloxml/__init__.py", line 61, in build_from_file
    self.build(rootNode)
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/ete3/phyloxml/_phyloxml.py", line 464, in build
    self.buildChildren(child, node, nodeName_)
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/ete3/phyloxml/_phyloxml.py", line 470, in buildChildren
    obj_.build(child_)
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/ete3/phyloxml/_phyloxml_tree.py", line 120, in build
    self.phyloxml_phylogeny.buildAttributes(node, node.attrib, [])
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/ete3/phyloxml/_phyloxml.py", line 715, in buildAttributes
    value = find_attr_value_('rerootable', node)
  File "/nfs/users/nfs_a/alc/Documents/PythonModules/ete3/ete-3.0/ete3/phyloxml/_phyloxml.py", line 281, in find_attr_value_
    namespaces = six.itervalues(node.nsmap)
AttributeError: nsmap

 
I'm not sure why I get this error, but I found that I can instead use BioPython to parse the PhyloXML trees:
% python2
Then in the Python command prompt, get the names of genes in the tree, their sequences, their species, and the identifier for the whole tree:
>>> from Bio import Phylo
>>> tree = Phylo.read("example.xml", "phyloxml")
>>> for clade in tree.get_terminals():
...         print clade.name
WBGene00221255
BPAG_0000765501
nLs.2.1.2.g06925
WBGene00239008
nDi.2.2.2.g04666
>>> for clade in tree.get_terminals():
...         sequences_list = clade.sequences
...         sequence = sequences_list[0]
...         molseq = sequence.mol_seq 
...         print(molseq)
MQKQVKGQQHVNELLPSVIQQEDPTNDSLSPLSSARCTDQYQSDKLLAEIRNGVQLRHVIPNDHKQCAKMLYDKACEKKEENNINEKETKLETEKAINIGNILLNAMQKRRLLMHFG
MRNFYIEIRNGVQLRHVIPNDHKQCAKMLYDKACEKKEENNINEKETKLETERAINIGNILLNAMHKRRLLMHSESSSSSDENINEQKLELWSDTDDNYDDNNDNDNNYNDINKSMNKTQM
MSEKMPLDRAERIEQHKHELISTDPQQNNLISTNYSASSSSSPPSTQQQRQNDKLLTEINKGIHLRHVIPNAHKQCISMLYDKKCEKSSTTEKDSVKEDETKLEMEKSTDIGNVLLNAMRKRRLLVEFESSSISNEDGNERRLEQWSDSDDTDSNAGKFRN
MNDLSSSSHTDRYVDQNQNNKLLTEIRSRIHLRHVIPNEHKQCTKMLYDKTFDKAIISNEDNVSKKETKLDEQKRIHAENLLINAMRKRRFVMRFESSSSDENETEQELEH
MQKVNKAAISDKQIKALLKRISQKEQVREFHAINIQGENSTNDQSLSSSSNHCNDQYQYDKLLTEIRNGIELRHVIPNEHKRCAKMLYDKTIEELFISGEDKINDKQIKLCEQKSTDDIDIENILINAMCKRRFAMRFESSSSDENVSGRELEHWYDADDGNDDE 
>>>  for clade in tree.get_terminals():
...         taxonomy_list = clade.taxonomies
...         taxonomy = taxonomy_list[0]
...         species = taxonomy.scientific_name
...         print(species)
Brugia malayi strain FR3
Brugia pahangi strain Scotland/Glasgow
Litomosoides sigmodontis strain Bain lab strain
Onchocerca volvulus strain O. volvulus Cameroon isolate
Dirofilaria immitis

>>> tree_properties = tree.properties
>>> print(tree_properties[0]).value
WBGT00000000028368

No comments: