Friday 24 March 2017

Finding the list of bioactivities for a ChEMBL target

Here's how to find the list of bioactivities for a ChEMBL target:

From the ChEMBL webpage:
-a) paste the target accession (eg. 'O60760') in the search bar and click on targets
-b) on the page that you land on - click on the right hand side where it says 'Please select"
c) Download the bioactivities and you can search for the ChEMBL_Id of the molecule you are interested in, and this will give you the links to the publication where the bioactivities were reported

Thanks to Prudence from ChEMBL for this!

Hierarchical clustering a big data set in R

A while ago I wrote some notes on hierarchical clustering using hclust in R. Now I needed to do hierarchical clustering on a big distance matrix, 26104 x 26104 in size. Will it run? It turns out that hierarchical clustering is O(n^3) and so is slow to run and needs lots of memory for big data sets. I found some discussions about this on the web at  http://stackoverflow.com/. However, I did get it to run, by requesting 25000 Mbyte of RAM! Hurray! And it only took 15 mins to run!
   Note however I started with an even bigger data set and found it too big to run hierarchical clustering on, but managed to break down my clusters into a smaller size first by finding clusters in Python using a community detection algorithm (see finding-communities-in-graph-using.html). Then I was able to run the hierarchical clustering on each of the clusters found by Python (the biggest of which was the 26104-element case above).

Monday 20 March 2017

Finding communities in a graph using Python

Before, I wrote some notes on finding communities (clusters) in a graph using R. I've now been finding out how to do this in Python.

Here's some options I explored

- I found that the networkx Python module for graphs has a function to find k-clique communities in a graph. However, these communities seem to be very small ones, it split up my network too much!

- I found a nice discussion of community detection in Python on stackoverflow.

- I read that another option would be igraph but it turns out that installing it isn't very easy, so I decided to try something else first..

- I then found that there is a Python module called community that contains some nice functions for finding communities. It can find the best partition of a network into communities, here is my code for this (below). By the way, I found that the 'community' module seems to expect that the input to its 'best_partition' function will be a connected component (ie. each node can be reached from every other node), so I needed to first split my original graph into connected components, and then find communities in each connected component:


import networkx as nx # following example at https://networkx.github.io/examples.html
sys.path.append("/nfs/users/nfs_a/alc/Documents/git/Python/taynaud-python-louvain-6a3696fdce57/") # this is where I downloaded the 'community' module
import community


# first read the graph into networkx, it is in a file with pairs of chemical compounds that I want to be the nodes, and I want an edge between each pair:

G = nx.Graph() # create a graph in networkxnodeset = set() # set of nodes in our graph already
# read in the input file:

fileObj = open(input_pairs_file, "r")
for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        compound1 = int(temp[0])
        compound2 = int(temp[1])

        similarity = float(temp[2]))
        if compound1 not in nodeset:
            G.add_node(compound1)
        if compound2 not in nodeset:
            G.add_node(compound2)

        G.add_edge(compound1,compound2, weight=similarity)
fileObj.close()
 

# first split the graph into connected components:
subgraphs = list(nx.connected_component_subgraphs(G))
for subgraph in subgraphs:
        # find the best partition of the data:
        partition = community.best_partition(subgraph) # we can now print out this partition if we want

Clustering and heatmaps in Python

Not long ago, I wrote a blog post about clustering and heatmaps in R. I'd like now to learn how to do this in Python. This is just a note to myself, that I came across a nice blog post on clustering and dendrograms in Python.

Saturday 11 March 2017

Fast way to find connected components

I have a big graph, where the nodes are chemicals, and there is an edge between any two chemicals. My question is: can I find connected components?

My data is stored in a file called "my_input" like this, with distances between chemicals in the 3rd column:
chemical1 chemical2 0.32
chemical1 chemical3 0.55
etc.
I have 116,506,602 pairs of compounds so it's a lot!

Attempt 1: using R to find connected components (slow!)
This turned out to be very slow. I tried using an R script below:

# Function to make a graph
makegraph <- function(myfile)
{
   library("graph")
   mytable <- read.table(file(myfile),header=FALSE,sep="\t") # Store the data in a data frame
   proteins1 <- mytable$V1
   proteins2 <- mytable$V2
   protnames <- c(levels(proteins1),levels(proteins2))
   # Find out how many pairs of proteins there are
   numpairs <- length(proteins1)
   # Find the unique protein names:
   uniquenames <-  unique(protnames)
   # Make a graph for these proteins with no edges:
   mygraph <- new("graphNEL", nodes = uniquenames)
   # Add edges to the graph:
   # See http://rss.acs.unt.edu/Rdoc/library/graph/doc/graph.pdf for more examples
   weights <- rep(1,numpairs)
   mygraph2 <- addEdge(as.vector(proteins1),as.vector(proteins2),mygraph,weights)
                           
   return(mygraph2)    
}

library("RBGL")
mygraph <- makegraph("my_input") # takes in my input file
# First find the connected components in the graph:
myconnectedcomponents <- connectedComp(mygraph)
# Find number of connected components:
numcomponents <- length(myconnectedcomponents)
# write the members to an output file:
sink('connected_components.out')
for (i in 1:numcomponents) {
   component <- myconnectedcomponents[[i]] # Store the connected component in a vector "component"
   print(paste("Component",i,":",component))
}
sink()
print(paste("FINISHED"))


Attempt 2: using Python Networkx to find connected components
In the end I wrote my own Python script to find connected components (see Attempt 3 below), but it was pointed out to me that I should have investigated the Python NetworkX library.

I later did this, and found it's pretty easy and fast using NetworkX to find connected components.
This will let you read in your graph as 'G', then find connected components, and return them as subgraphs:

import networkx as nx # following example at https://networkx.github.io/examples.html
G = nx.Graph() # create a graph in networkx
nodeset = set() # set of nodes in our graph already 

# read in the input file with the pairs of nodes:
fileObj = open(output_pairs_file, "r")
for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        compound1 = int(temp[0])
        compound2 = int(temp[1])
        if compound1 not in nodeset:
            G.add_node(compound1)
        if compound2 not in nodeset:
            G.add_node(compound2)
        G.add_edge(compound1,compound2)

fileObj.close()
subgraphs = list(nx.connected_component_subgraphs(G))


Attempt 3: using Python to find connected components (fast!)
This attempt was much faster than the first one. The key was to store as much as possible in arrays and temporary files rather than hashes. It took 28 seconds to run on my big data set! : )
I guess the reason is that R is trying to create an object for each node, and probably takes loads of memory to do this. I try to store all the info in small arrays, using the node number as the index, so that's faster..

import sys
import os
from collections import defaultdict

#====================================================================#

# read in the original compound name for each compound:

def read_original_compound_names(output_file_node_labels):

    # define a list for storing the original compound names:
    original_compound_name = list()
    fileObj = open(output_file_node_labels,"r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        original_name = temp[0]
        original_compound_name.append(original_name)
    fileObj.close()

    return original_compound_name

#====================================================================#

# The input file has the connected nodes, in format node1\tnode2\tdist
# We want to relabel the nodes as 1,2,3...

def relabel_nodes(input_pairs_file, output_pairs_file, output_file_node_labels):

    # define a dictionary to store the node number for an original compound name:
    node_number = defaultdict()

    # open an output file for putting the nodes relabelled as 1,2,3...
    outputfileObj = open(output_pairs_file, "w")

    # open an output file for putting the orignal node names corresponding to the node labels 1,2,3...
    outputfileObj2 = open(output_file_node_labels, "w")

    # read in the input file:
    num_compounds = 0
    fileObj = open(input_pairs_file, "r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")

        compound1 = temp[0]
        compound2 = temp[1]
        for compound in [compound1, compound2]:
            if compound not in node_number:
                num_compounds += 1
                node_number[compound] = num_compounds
                line2 = "%s\t%s\n" % (compound, num_compounds)
                outputfileObj2.write(line2)
        line = "%s\t%s\n" % (node_number[compound1], node_number[compound2])
        outputfileObj.write(line)
    fileObj.close()
    outputfileObj.close()
    outputfileObj2.close()

    return num_compounds

#====================================================================#

# read in the new version of the input file (output_pairs_file), and store the cluster number
# for each compound in a list 'compound_cluster_number':

def store_compounds_cluster_number(output_pairs_file, num_compounds):

    # define a list for storing the cluster number of compounds: (a list rather than a dictionary to store space)
    # we can initialise the list to contain zeroes
    compound_cluster_number = [0] * num_compounds

    # define a list of sets, for storing the clusters that a particular cluster should be merged to:
    clusters_to_merge_with = list()

    # define a list of sets, for storing the members of each cluster:
    members_of_cluster = list()

    # read in the input file:
    num_clusters = 0
    fileObj = open(output_pairs_file, "r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        compound1 = int(temp[0])
        compound2 = int(temp[1])
        (compound_cluster_number,clusters_to_merge_with,members_of_cluster,num_clusters) = store_cluster_numbers_for_two_compounds(compound1,compound2,compound_cluster_number,clusters_to_merge_with,members_of_cluster,num_clusters)
    fileObj.close()
    return(compound_cluster_number, members_of_cluster, clusters_to_merge_with, num_clusters)

#====================================================================#

# take a pair of compounds, and store the cluster numbers for them:

def store_cluster_numbers_for_two_compounds(compound1,compound2,compound_cluster_number,clusters_to_merge_with,members_of_cluster,num_clusters):
    """store cluster numbers for two compounds
    >>> store_cluster_numbers_for_two_compounds(3,4,[1,1,0,0],[set()],[{1,2}], 1) # example where compound1 and compound2 (nodes 3,4) are not in any cluster yet, and we have one cluster so far with nodes 1,2   
    ([1, 1, 2, 2], [set(), set()], [{1, 2}, {3, 4}], 2)
    >>> store_cluster_numbers_for_two_compounds(2,3,[1,1,0],[set()],[{1,2}], 1) # example where compound1 (node 2) is in a cluster already, but compound2 (node 3) is not, and we have one cluster so far with nodes 1,2
    ([1, 1, 1], [set()], [{1, 2, 3}], 1)
    >>> store_cluster_numbers_for_two_compounds(2,3,[1,0,1],[set()],[{1,3}], 1) # example where compound2 (node 3) is in a cluster already, but compound1 (node 2) is not, and we have one cluster so far with nodes 1,3
    ([1, 1, 1], [set()], [{1, 2, 3}], 1)
    >>> store_cluster_numbers_for_two_compounds(1,3,[1,1,2,2],[set(),set()],[{1,2},{3,4}], 2) # example where we have two clusters already, with nodes 1,2 and 3,4, and compound1 (node 1) and compound2 (node 3) belong to different ones
    ([1, 1, 2, 2], [{2}, {1}], [{1, 2}, {3, 4}], 2)
    """
    # tested by typing: python3 -m doctest find_connected_components.py

    if compound_cluster_number[compound1-1] != 0 and compound_cluster_number[compound2-1] == 0: # compound1 is in a cluster but compound2 is not
        cluster_number = compound_cluster_number[compound1-1]
        compound_cluster_number[compound2-1] = cluster_number      
        members_of_cluster[cluster_number-1].add(compound2)
    elif compound_cluster_number[compound1-1] == 0 and compound_cluster_number[compound2-1] != 0: # compound1 is not in a cluster, but compound2 is
        cluster_number = compound_cluster_number[compound2-1]
        compound_cluster_number[compound1-1] = cluster_number      
        members_of_cluster[cluster_number-1].add(compound1)
    elif compound_cluster_number[compound1-1] == 0 and compound_cluster_number[compound2-1] == 0: # compound1 and compound2 are not in any cluster yet
        num_clusters += 1
        clusters_to_merge_with.append(set())
        members_of_cluster.append(set([compound1,compound2]))
        compound_cluster_number[compound1-1] = num_clusters
        compound_cluster_number[compound2-1] = num_clusters
    elif compound_cluster_number[compound1-1] != 0 and compound_cluster_number[compound2-1] != 0: # compound1 and compound2 are in clusters already
        if compound_cluster_number[compound1-1] != compound_cluster_number[compound2-1]: # they are not in the same cluster
            # we need to merge these clusters:
            cluster_number1 = compound_cluster_number[compound1-1]
            cluster_number2 = compound_cluster_number[compound2-1]
            clusters_to_merge_with[cluster_number1-1].add(cluster_number2)
            clusters_to_merge_with[cluster_number2-1].add(cluster_number1)
    else:
        print("ERROR: should not arrive here")
        sys.exit(1)
    return (compound_cluster_number,clusters_to_merge_with,members_of_cluster, num_clusters)

#====================================================================#

# merge all clusters that need merging:

def merge_clusters(compound_cluster_number, members_of_cluster, clusters_to_merge_with):
    """merge all clusters that need merging
    >>> merge_clusters([1, 1, 2, 2], [{1, 2}, {3, 4}], [{2}, {1}]) # there are two clusters, which should be merged into one
    [{1, 2, 3, 4}, set()]
    >>> merge_clusters([1, 1, 2, 2], [{1, 2}, {3, 4}], [set(), set()]) # there are two clusters, which should not be merged
    [{1, 2}, {3, 4}]
    >>> merge_clusters([1, 1, 2, 2, 3], [{1, 2}, {3, 4}, {5}], [{2}, {1, 3}, {2}]) # three are three clusters, which should be merged
    [{1, 2, 3, 4, 5}, set(), set()]
    >>> merge_clusters([1, 1, 2, 2, 3], [{1, 2}, {3, 4}, {5}], [set(), {3}, {2}]) # there are three clusters, the last 2 should be merged
    [{1, 2}, {3, 4, 5}, set()]
    >>> merge_clusters([1, 1, 2, 2, 3], [{1, 2}, {3, 4}, {5}], [{3}, set(), {1}]) # there are three clusters, the first and third should be merged
    [{1, 2, 5}, {3, 4}, set()]
    >>> merge_clusters([1, 1, 2, 2, 3, 4], [{1, 2}, {3, 4}, {5}, {6}], [{3}, set(), {1,4}, {3}]) # there are 4 clusters, the 1st, 3rd and 4th should be merged
    [{1, 2, 5, 6}, {3, 4}, set(), set()]
    """
    # tested using: python3 -m doctest find_connected_components.py

    for cluster_index, this_cluster_set_to_merge_with in enumerate(clusters_to_merge_with):
        if len(this_cluster_set_to_merge_with) > 0: # if this cluster should be merged with one or more other clusters
            # find the full set of clusters that this cluster should be merged with:
            full_set_to_merge_with = find_full_set_to_merge_with(this_cluster_set_to_merge_with, clusters_to_merge_with, cluster_index + 1)
            for cluster_number_to_merge_with in full_set_to_merge_with:
                assert(len(members_of_cluster[cluster_number_to_merge_with-1]) > 0)
                for compound in members_of_cluster[cluster_number_to_merge_with-1]:
                    # don't need this: compound_cluster_number[compound-1] = cluster_index
                    members_of_cluster[cluster_index].add(compound)
                clusters_to_merge_with[cluster_number_to_merge_with-1] = set()
                members_of_cluster[cluster_number_to_merge_with-1] = set()

    return members_of_cluster

#====================================================================#

# find the full set of clusters that this cluster should be merged with:

def find_full_set_to_merge_with(this_cluster_set_to_merge_with, clusters_to_merge_with, this_cluster_number):
    """find the full set of clusters that this cluster should be merged with
    >>> find_full_set_to_merge_with({2}, [{2}, {1}], 1) # this cluster is cluster 1, it needs to be merged with cluster 2
    {2}
    >>> find_full_set_to_merge_with({1}, [{2}, {1}], 2) # this cluster is cluster 1, it needs to be merged with cluster 1
    {1}
    >>> find_full_set_to_merge_with({2}, [{2}, {1, 3}, {2}], 1) # this cluster is cluster 1, it needs to be merged with clusters 2,3
    {2, 3}
    >>> find_full_set_to_merge_with({2}, [{2}, {1, 3}, {2, 4}, {3}], 1) # this cluster is cluster 1, it needs to be merged with clusters 2,3,4
    {2, 3, 4}
    >>> find_full_set_to_merge_with({2}, [{2}, {1, 3}, {2}, set()], 1) # this cluster is cluster 1, it needs to be merged with clusters 2,3
    {2, 3}
    """

    full_set_to_merge_with = set()
    found_some_new = True

    while found_some_new == True:
        new_to_merge_with = set()
        for cluster_number_to_merge_with in this_cluster_set_to_merge_with:
            if cluster_number_to_merge_with != this_cluster_number and cluster_number_to_merge_with not in full_set_to_merge_with:
                full_set_to_merge_with.add(cluster_number_to_merge_with)
                if len(clusters_to_merge_with[cluster_number_to_merge_with-1]) > 0:
                    new_to_merge_with.update(clusters_to_merge_with[cluster_number_to_merge_with-1])
        if len(new_to_merge_with) == 0:
            found_some_new = False
        else:
            this_cluster_set_to_merge_with.update(new_to_merge_with)

    return full_set_to_merge_with

#====================================================================#

# print out the compounds in each cluster: 

def print_out_clusters(members_of_cluster, original_compound_name, output_file):

    outputfileObj = open(output_file, "w")

    cluster_num = 0
    for cluster_index, this_cluster_members in enumerate(members_of_cluster):
        if len(this_cluster_members) > 0:
            cluster_num += 1
            for compound in this_cluster_members:
                compound_name = original_compound_name[compound-1]
                line = "%s\t%s\n" % (cluster_num, compound_name)
                outputfileObj.write(line)
                compound_name = original_compound_name[compound-1]
                line = "%s\t%s\n" % (cluster_num, compound_name)
                outputfileObj.write(line)
    outputfileObj.close()

    return

#====================================================================#

def main():
   
    # check the command-line arguments:
    if len(sys.argv) != 5 or os.path.exists(sys.argv[1]) == False:
        print("Usage: %s input_pairs_file output_pairs_file output_file_node_labels, output_file" % sys.argv[0])
        sys.exit(1)
    input_pairs_file = sys.argv[1] # input file of connected nodes. This has format node1\tnode2\tdist. eg. reformatted_data_anthelminticdists.filt2b_cutoff0.65
    output_pairs_file = sys.argv[2] # output file of connected nodes, with the nodes labelled as 1,2,3...
    output_file_node_labels = sys.argv[3] # output file with the original node names corresponding to node labels 1,2,3...
    output_file = sys.argv[4] # output file for writing out the members of each connected component

    # relabel the  nodes as 1,2,3...
    num_compounds = relabel_nodes(input_pairs_file, output_pairs_file, output_file_node_labels)

    # read in the new version of the input file (output_pairs_file), and store the cluster number
    # for each compound in a list 'compound_cluster_number':
    (compound_cluster_number, members_of_cluster, clusters_to_merge_with, num_clusters) = store_compounds_cluster_number(output_pairs_file, num_compounds)

    # now merge all clusters that need merging:
    members_of_cluster = merge_clusters(compound_cluster_number, members_of_cluster, clusters_to_merge_with)

    # now read in the original compound name for each compound:
    original_compound_name = read_original_compound_names(output_file_node_labels)

    # now print out the compounds in each cluster: 
    print_out_clusters(members_of_cluster, original_compound_name, output_file)

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#