Saturday, 11 March 2017

Fast way to find connected components

I have a big graph, where the nodes are chemicals, and there is an edge between any two chemicals. My question is: can I find connected components?

My data is stored in a file called "my_input" like this, with distances between chemicals in the 3rd column:
chemical1 chemical2 0.32
chemical1 chemical3 0.55
etc.
I have 116,506,602 pairs of compounds so it's a lot!

Attempt 1: using R to find connected components (slow!)
This turned out to be very slow. I tried using an R script below:

# Function to make a graph
makegraph <- function(myfile)
{
   library("graph")
   mytable <- read.table(file(myfile),header=FALSE,sep="\t") # Store the data in a data frame
   proteins1 <- mytable$V1
   proteins2 <- mytable$V2
   protnames <- c(levels(proteins1),levels(proteins2))
   # Find out how many pairs of proteins there are
   numpairs <- length(proteins1)
   # Find the unique protein names:
   uniquenames <-  unique(protnames)
   # Make a graph for these proteins with no edges:
   mygraph <- new("graphNEL", nodes = uniquenames)
   # Add edges to the graph:
   # See http://rss.acs.unt.edu/Rdoc/library/graph/doc/graph.pdf for more examples
   weights <- rep(1,numpairs)
   mygraph2 <- addEdge(as.vector(proteins1),as.vector(proteins2),mygraph,weights)
                           
   return(mygraph2)    
}

library("RBGL")
mygraph <- makegraph("my_input") # takes in my input file
# First find the connected components in the graph:
myconnectedcomponents <- connectedComp(mygraph)
# Find number of connected components:
numcomponents <- length(myconnectedcomponents)
# write the members to an output file:
sink('connected_components.out')
for (i in 1:numcomponents) {
   component <- myconnectedcomponents[[i]] # Store the connected component in a vector "component"
   print(paste("Component",i,":",component))
}
sink()
print(paste("FINISHED"))


Attempt 2: using Python Networkx to find connected components
In the end I wrote my own Python script to find connected components (see Attempt 3 below), but it was pointed out to me that I should have investigated the Python NetworkX library.

I later did this, and found it's pretty easy and fast using NetworkX to find connected components.
This will let you read in your graph as 'G', then find connected components, and return them as subgraphs:

import networkx as nx # following example at https://networkx.github.io/examples.html
G = nx.Graph() # create a graph in networkx
nodeset = set() # set of nodes in our graph already 

# read in the input file with the pairs of nodes:
fileObj = open(output_pairs_file, "r")
for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        compound1 = int(temp[0])
        compound2 = int(temp[1])
        if compound1 not in nodeset:
            G.add_node(compound1)
        if compound2 not in nodeset:
            G.add_node(compound2)
        G.add_edge(compound1,compound2)

fileObj.close()
subgraphs = list(nx.connected_component_subgraphs(G))


Attempt 3: using Python to find connected components (fast!)
This attempt was much faster than the first one. The key was to store as much as possible in arrays and temporary files rather than hashes. It took 28 seconds to run on my big data set! : )
I guess the reason is that R is trying to create an object for each node, and probably takes loads of memory to do this. I try to store all the info in small arrays, using the node number as the index, so that's faster..

import sys
import os
from collections import defaultdict

#====================================================================#

# read in the original compound name for each compound:

def read_original_compound_names(output_file_node_labels):

    # define a list for storing the original compound names:
    original_compound_name = list()
    fileObj = open(output_file_node_labels,"r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        original_name = temp[0]
        original_compound_name.append(original_name)
    fileObj.close()

    return original_compound_name

#====================================================================#

# The input file has the connected nodes, in format node1\tnode2\tdist
# We want to relabel the nodes as 1,2,3...

def relabel_nodes(input_pairs_file, output_pairs_file, output_file_node_labels):

    # define a dictionary to store the node number for an original compound name:
    node_number = defaultdict()

    # open an output file for putting the nodes relabelled as 1,2,3...
    outputfileObj = open(output_pairs_file, "w")

    # open an output file for putting the orignal node names corresponding to the node labels 1,2,3...
    outputfileObj2 = open(output_file_node_labels, "w")

    # read in the input file:
    num_compounds = 0
    fileObj = open(input_pairs_file, "r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")

        compound1 = temp[0]
        compound2 = temp[1]
        for compound in [compound1, compound2]:
            if compound not in node_number:
                num_compounds += 1
                node_number[compound] = num_compounds
                line2 = "%s\t%s\n" % (compound, num_compounds)
                outputfileObj2.write(line2)
        line = "%s\t%s\n" % (node_number[compound1], node_number[compound2])
        outputfileObj.write(line)
    fileObj.close()
    outputfileObj.close()
    outputfileObj2.close()

    return num_compounds

#====================================================================#

# read in the new version of the input file (output_pairs_file), and store the cluster number
# for each compound in a list 'compound_cluster_number':

def store_compounds_cluster_number(output_pairs_file, num_compounds):

    # define a list for storing the cluster number of compounds: (a list rather than a dictionary to store space)
    # we can initialise the list to contain zeroes
    compound_cluster_number = [0] * num_compounds

    # define a list of sets, for storing the clusters that a particular cluster should be merged to:
    clusters_to_merge_with = list()

    # define a list of sets, for storing the members of each cluster:
    members_of_cluster = list()

    # read in the input file:
    num_clusters = 0
    fileObj = open(output_pairs_file, "r")
    for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        compound1 = int(temp[0])
        compound2 = int(temp[1])
        (compound_cluster_number,clusters_to_merge_with,members_of_cluster,num_clusters) = store_cluster_numbers_for_two_compounds(compound1,compound2,compound_cluster_number,clusters_to_merge_with,members_of_cluster,num_clusters)
    fileObj.close()
    return(compound_cluster_number, members_of_cluster, clusters_to_merge_with, num_clusters)

#====================================================================#

# take a pair of compounds, and store the cluster numbers for them:

def store_cluster_numbers_for_two_compounds(compound1,compound2,compound_cluster_number,clusters_to_merge_with,members_of_cluster,num_clusters):
    """store cluster numbers for two compounds
    >>> store_cluster_numbers_for_two_compounds(3,4,[1,1,0,0],[set()],[{1,2}], 1) # example where compound1 and compound2 (nodes 3,4) are not in any cluster yet, and we have one cluster so far with nodes 1,2   
    ([1, 1, 2, 2], [set(), set()], [{1, 2}, {3, 4}], 2)
    >>> store_cluster_numbers_for_two_compounds(2,3,[1,1,0],[set()],[{1,2}], 1) # example where compound1 (node 2) is in a cluster already, but compound2 (node 3) is not, and we have one cluster so far with nodes 1,2
    ([1, 1, 1], [set()], [{1, 2, 3}], 1)
    >>> store_cluster_numbers_for_two_compounds(2,3,[1,0,1],[set()],[{1,3}], 1) # example where compound2 (node 3) is in a cluster already, but compound1 (node 2) is not, and we have one cluster so far with nodes 1,3
    ([1, 1, 1], [set()], [{1, 2, 3}], 1)
    >>> store_cluster_numbers_for_two_compounds(1,3,[1,1,2,2],[set(),set()],[{1,2},{3,4}], 2) # example where we have two clusters already, with nodes 1,2 and 3,4, and compound1 (node 1) and compound2 (node 3) belong to different ones
    ([1, 1, 2, 2], [{2}, {1}], [{1, 2}, {3, 4}], 2)
    """
    # tested by typing: python3 -m doctest find_connected_components.py

    if compound_cluster_number[compound1-1] != 0 and compound_cluster_number[compound2-1] == 0: # compound1 is in a cluster but compound2 is not
        cluster_number = compound_cluster_number[compound1-1]
        compound_cluster_number[compound2-1] = cluster_number      
        members_of_cluster[cluster_number-1].add(compound2)
    elif compound_cluster_number[compound1-1] == 0 and compound_cluster_number[compound2-1] != 0: # compound1 is not in a cluster, but compound2 is
        cluster_number = compound_cluster_number[compound2-1]
        compound_cluster_number[compound1-1] = cluster_number      
        members_of_cluster[cluster_number-1].add(compound1)
    elif compound_cluster_number[compound1-1] == 0 and compound_cluster_number[compound2-1] == 0: # compound1 and compound2 are not in any cluster yet
        num_clusters += 1
        clusters_to_merge_with.append(set())
        members_of_cluster.append(set([compound1,compound2]))
        compound_cluster_number[compound1-1] = num_clusters
        compound_cluster_number[compound2-1] = num_clusters
    elif compound_cluster_number[compound1-1] != 0 and compound_cluster_number[compound2-1] != 0: # compound1 and compound2 are in clusters already
        if compound_cluster_number[compound1-1] != compound_cluster_number[compound2-1]: # they are not in the same cluster
            # we need to merge these clusters:
            cluster_number1 = compound_cluster_number[compound1-1]
            cluster_number2 = compound_cluster_number[compound2-1]
            clusters_to_merge_with[cluster_number1-1].add(cluster_number2)
            clusters_to_merge_with[cluster_number2-1].add(cluster_number1)
    else:
        print("ERROR: should not arrive here")
        sys.exit(1)
    return (compound_cluster_number,clusters_to_merge_with,members_of_cluster, num_clusters)

#====================================================================#

# merge all clusters that need merging:

def merge_clusters(compound_cluster_number, members_of_cluster, clusters_to_merge_with):
    """merge all clusters that need merging
    >>> merge_clusters([1, 1, 2, 2], [{1, 2}, {3, 4}], [{2}, {1}]) # there are two clusters, which should be merged into one
    [{1, 2, 3, 4}, set()]
    >>> merge_clusters([1, 1, 2, 2], [{1, 2}, {3, 4}], [set(), set()]) # there are two clusters, which should not be merged
    [{1, 2}, {3, 4}]
    >>> merge_clusters([1, 1, 2, 2, 3], [{1, 2}, {3, 4}, {5}], [{2}, {1, 3}, {2}]) # three are three clusters, which should be merged
    [{1, 2, 3, 4, 5}, set(), set()]
    >>> merge_clusters([1, 1, 2, 2, 3], [{1, 2}, {3, 4}, {5}], [set(), {3}, {2}]) # there are three clusters, the last 2 should be merged
    [{1, 2}, {3, 4, 5}, set()]
    >>> merge_clusters([1, 1, 2, 2, 3], [{1, 2}, {3, 4}, {5}], [{3}, set(), {1}]) # there are three clusters, the first and third should be merged
    [{1, 2, 5}, {3, 4}, set()]
    >>> merge_clusters([1, 1, 2, 2, 3, 4], [{1, 2}, {3, 4}, {5}, {6}], [{3}, set(), {1,4}, {3}]) # there are 4 clusters, the 1st, 3rd and 4th should be merged
    [{1, 2, 5, 6}, {3, 4}, set(), set()]
    """
    # tested using: python3 -m doctest find_connected_components.py

    for cluster_index, this_cluster_set_to_merge_with in enumerate(clusters_to_merge_with):
        if len(this_cluster_set_to_merge_with) > 0: # if this cluster should be merged with one or more other clusters
            # find the full set of clusters that this cluster should be merged with:
            full_set_to_merge_with = find_full_set_to_merge_with(this_cluster_set_to_merge_with, clusters_to_merge_with, cluster_index + 1)
            for cluster_number_to_merge_with in full_set_to_merge_with:
                assert(len(members_of_cluster[cluster_number_to_merge_with-1]) > 0)
                for compound in members_of_cluster[cluster_number_to_merge_with-1]:
                    # don't need this: compound_cluster_number[compound-1] = cluster_index
                    members_of_cluster[cluster_index].add(compound)
                clusters_to_merge_with[cluster_number_to_merge_with-1] = set()
                members_of_cluster[cluster_number_to_merge_with-1] = set()

    return members_of_cluster

#====================================================================#

# find the full set of clusters that this cluster should be merged with:

def find_full_set_to_merge_with(this_cluster_set_to_merge_with, clusters_to_merge_with, this_cluster_number):
    """find the full set of clusters that this cluster should be merged with
    >>> find_full_set_to_merge_with({2}, [{2}, {1}], 1) # this cluster is cluster 1, it needs to be merged with cluster 2
    {2}
    >>> find_full_set_to_merge_with({1}, [{2}, {1}], 2) # this cluster is cluster 1, it needs to be merged with cluster 1
    {1}
    >>> find_full_set_to_merge_with({2}, [{2}, {1, 3}, {2}], 1) # this cluster is cluster 1, it needs to be merged with clusters 2,3
    {2, 3}
    >>> find_full_set_to_merge_with({2}, [{2}, {1, 3}, {2, 4}, {3}], 1) # this cluster is cluster 1, it needs to be merged with clusters 2,3,4
    {2, 3, 4}
    >>> find_full_set_to_merge_with({2}, [{2}, {1, 3}, {2}, set()], 1) # this cluster is cluster 1, it needs to be merged with clusters 2,3
    {2, 3}
    """

    full_set_to_merge_with = set()
    found_some_new = True

    while found_some_new == True:
        new_to_merge_with = set()
        for cluster_number_to_merge_with in this_cluster_set_to_merge_with:
            if cluster_number_to_merge_with != this_cluster_number and cluster_number_to_merge_with not in full_set_to_merge_with:
                full_set_to_merge_with.add(cluster_number_to_merge_with)
                if len(clusters_to_merge_with[cluster_number_to_merge_with-1]) > 0:
                    new_to_merge_with.update(clusters_to_merge_with[cluster_number_to_merge_with-1])
        if len(new_to_merge_with) == 0:
            found_some_new = False
        else:
            this_cluster_set_to_merge_with.update(new_to_merge_with)

    return full_set_to_merge_with

#====================================================================#

# print out the compounds in each cluster: 

def print_out_clusters(members_of_cluster, original_compound_name, output_file):

    outputfileObj = open(output_file, "w")

    cluster_num = 0
    for cluster_index, this_cluster_members in enumerate(members_of_cluster):
        if len(this_cluster_members) > 0:
            cluster_num += 1
            for compound in this_cluster_members:
                compound_name = original_compound_name[compound-1]
                line = "%s\t%s\n" % (cluster_num, compound_name)
                outputfileObj.write(line)
                compound_name = original_compound_name[compound-1]
                line = "%s\t%s\n" % (cluster_num, compound_name)
                outputfileObj.write(line)
    outputfileObj.close()

    return

#====================================================================#

def main():
   
    # check the command-line arguments:
    if len(sys.argv) != 5 or os.path.exists(sys.argv[1]) == False:
        print("Usage: %s input_pairs_file output_pairs_file output_file_node_labels, output_file" % sys.argv[0])
        sys.exit(1)
    input_pairs_file = sys.argv[1] # input file of connected nodes. This has format node1\tnode2\tdist. eg. reformatted_data_anthelminticdists.filt2b_cutoff0.65
    output_pairs_file = sys.argv[2] # output file of connected nodes, with the nodes labelled as 1,2,3...
    output_file_node_labels = sys.argv[3] # output file with the original node names corresponding to node labels 1,2,3...
    output_file = sys.argv[4] # output file for writing out the members of each connected component

    # relabel the  nodes as 1,2,3...
    num_compounds = relabel_nodes(input_pairs_file, output_pairs_file, output_file_node_labels)

    # read in the new version of the input file (output_pairs_file), and store the cluster number
    # for each compound in a list 'compound_cluster_number':
    (compound_cluster_number, members_of_cluster, clusters_to_merge_with, num_clusters) = store_compounds_cluster_number(output_pairs_file, num_compounds)

    # now merge all clusters that need merging:
    members_of_cluster = merge_clusters(compound_cluster_number, members_of_cluster, clusters_to_merge_with)

    # now read in the original compound name for each compound:
    original_compound_name = read_original_compound_names(output_file_node_labels)

    # now print out the compounds in each cluster: 
    print_out_clusters(members_of_cluster, original_compound_name, output_file)

#====================================================================#

if __name__=="__main__":
    main()

#====================================================================#


No comments: