Friday 31 January 2014

Using Python to find common ancestors (parents) of GO terms

I need a script to find the last common ancestor(s) of two Gene Ontology (GO) terms. Two GO terms may have more than one last common ancestor in the GO hierarchy, via different routes. For example, if the nodes N1...N15 in the diagram below represent GO terms in the GO hierarchy:
the last common ancestors of N1 and N2 are the set {N7, N5}, as they are the last common ancestors via different routes from N1 and N2. Likewise, the last common ancestor of N10 and N11 is N13.
The function find_lcas_of_pair of_GO_terms in my Python script  find_lca_of_go_terms.py finds the last common ancestor(s) of two GO terms in this way. (Thanks to my husband Noel for help with this!) For example, it finds the last common ancestor of the terms GO:0001578 (microtubule bundle formation) and GO:0030036 (actin cytoskeleton organization) to be GO:0007010 (cytoskeleton organization). The same example was given in a biostars discussion, which gave a solution to the problem in Java.

If we want to find the last common ancestors (LCAs) of the GO terms of two genes, where gene 1 has GO terms N1, N10 and gene 2 has GO terms N2, N11, then we can do the following:
(i) find the LCAs for N1 and N2
(ii) find the LCAs for N1 and N11
(iii) find the LCAs for N10 and N2
(iv) find the LCAs for N10 and N11
(v) find the union of the LCAs found in (i), (ii), (iii) and (iv)
(vi) remove any GO term from the set of LCA in step (v), if it is an ancestor (in the GO hierarchy) of another GO term in that set.
This function find_lcas_of_GO_terms_for_two_genes in my Python script does this. For this example, it finds the LCAs of the GO terms to be the set {N5, N7, 13}. Taking another example, if your first gene has GO terms GO:0001578, GO:0004104, and GO:0004835 and your second gene has GO terms GO:0030036, GO:0003990, it finds the LCAs of the GO terms for the two genes to be the set {GO:0004104, GO:0007010}.

What about if you have multiple genes? The function find_lcas_of_GO_terms_for_many_genes deals with this case. It finds the set of LCAs for the first pair of genes (giving set 1). Then it finds the set of LCAs between the GO terms in set 1 and the GO terms for the second gene (giving set 2). Then it finds the set of LCAs for the GO terms in set 2 and the GO terms for the third gene (giving set 3), and so on, until you have gone through all the genes.

No comments: