Monday 20 March 2017

Finding communities in a graph using Python

Before, I wrote some notes on finding communities (clusters) in a graph using R. I've now been finding out how to do this in Python.

Here's some options I explored

- I found that the networkx Python module for graphs has a function to find k-clique communities in a graph. However, these communities seem to be very small ones, it split up my network too much!

- I found a nice discussion of community detection in Python on stackoverflow.

- I read that another option would be igraph but it turns out that installing it isn't very easy, so I decided to try something else first..

- I then found that there is a Python module called community that contains some nice functions for finding communities. It can find the best partition of a network into communities, here is my code for this (below). By the way, I found that the 'community' module seems to expect that the input to its 'best_partition' function will be a connected component (ie. each node can be reached from every other node), so I needed to first split my original graph into connected components, and then find communities in each connected component:


import networkx as nx # following example at https://networkx.github.io/examples.html
sys.path.append("/nfs/users/nfs_a/alc/Documents/git/Python/taynaud-python-louvain-6a3696fdce57/") # this is where I downloaded the 'community' module
import community


# first read the graph into networkx, it is in a file with pairs of chemical compounds that I want to be the nodes, and I want an edge between each pair:

G = nx.Graph() # create a graph in networkxnodeset = set() # set of nodes in our graph already
# read in the input file:

fileObj = open(input_pairs_file, "r")
for line in fileObj:
        line = line.rstrip()
        temp = line.split("\t")
        compound1 = int(temp[0])
        compound2 = int(temp[1])

        similarity = float(temp[2]))
        if compound1 not in nodeset:
            G.add_node(compound1)
        if compound2 not in nodeset:
            G.add_node(compound2)

        G.add_edge(compound1,compound2, weight=similarity)
fileObj.close()
 

# first split the graph into connected components:
subgraphs = list(nx.connected_component_subgraphs(G))
for subgraph in subgraphs:
        # find the best partition of the data:
        partition = community.best_partition(subgraph) # we can now print out this partition if we want

No comments: