Friday 15 July 2016

Finding all descendants of a GO term

I wanted to find all descendants of a GO term in the GO hierarchy. First I downloaded the hierarchy from the GO website:
% wget http://purl.obolibrary.org/obo/go/go-basic.obo

Then I wrote a little Python script to find all the descendants of a GO term:

from collections import defaultdict
import sys
import os

#====================================================================#

# define a function to record the children of each GO term in the GO hierarchy:

def read_go_children(input_go_obo_file):
    """record the children of each GO term in the GO hierarchy"""

    # first read in the input GO file, and make a list of all the GO terms, and the
    # terms below them in the GO hierarchy:
    # eg.
    # [Term]
    # id: GO:0004835
    children = defaultdict(list) # children of a particular GO term, in the hierarchy
    take = 0

    fileObj = open(input_go_obo_file, "r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split()
        if len(temp) == 1:
           if temp[0] == '[Term]':
               take = 1
        elif len(temp) >= 2 and take == 1:
            if temp[0] == 'id:':
                go = temp[1]
            elif temp[0] == 'is_a:': # eg. is_a: GO:0048308 ! organelle inheritance
                parent = temp[1]
                children[parent].append(go) # record that a child of 'parent' is term 'go'
        elif len(temp) == 0:
            take = 0
    fileObj.close()

    return children

#====================================================================#

# define a function to find all descendants of a GO term in the GO hierarchy:

def find_all_descendants(input_go_term, children):

    """find all the descendants of a GO term in the GO hierarchy
    >> find_all_descendants('GO1', {'GO1': ['GO2', 'GO3'], 'GO2': ['GO4']})
    ['GO1', 'GO2', 'GO3', 'GO4']
    """

    descendants = set()
    queue = []
    queue.append(input_go_term)
    while queue:
        node = queue.pop(0)
        if node in children and node not in descendants: # don't visit a second time
            node_children = children[node]
            queue.extend(node_children)
        descendants.add(node)

    return descendants 

#====================================================================#

def main():
   
    # check the command-line arguments:
    if len(sys.argv) != 3 or os.path.exists(sys.argv[1]) == False:
        print("Usage: %s input_go_obo_file input_go_term" % sys.argv[0])
        sys.exit(1)
    input_go_obo_file = sys.argv[1] # the input gene ontology file eg. gene_ontology.WS238.obo
    input_go_term = sys.argv[2] # the input GO term of interest

    # read in the children of each GO term in the GO hierachy:
    children = read_go_children(input_go_obo_file)

    # find all the descendants of the term 'input_go_term':
    descendants = find_all_descendants(input_go_term, children)
    descendantlist = ",".join(descendants)
    print("DESCENDANTS:",descendantlist)

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#


To run it I type, for example, to find descendants of the term GO:0008291 (acetylcholine metabolic process):
% python3 find_descendants_of_GO_term.py go-basic.obo 'GO:0008291'
This gives me:
DESCENDANTS: GO:0001507,GO:0006581,GO:0008291,GO:0008292

The descendants are:
GO:0006581 acetylcholine catabolic process 
GO:0001507 acetylcholine catabolic process in synaptic cleft (descendant of acetylcholine catabolic process)
GO:0008292 acetylcholine biosynthetic process

Hurray for Python!

Notes
- I see a discussion on biostars about doing something similar in R. I didn't check if that gives the same result.
- It turns out that you can also do this using WormMine (I don't know if this is for just C. elegans-related GO terms though)

No comments: