Before, I wrote some notes on finding communities (clusters) in a graph using R. I've now been finding out how to do this in Python.
Here's some options I explored
- I found that the networkx Python module for graphs has a function to find k-clique communities in a graph. However, these communities seem to be very small ones, it split up my network too much!
- I found a nice discussion of community detection in Python on stackoverflow.
- I read that another option would be igraph but it turns out that installing it isn't very easy, so I decided to try something else first..
- I then found that there is a Python module called community that contains some nice functions for finding communities. It can find the best partition of a network into communities, here is my code for this (below). By the way, I found that the 'community' module seems to expect that the input to its 'best_partition' function will be a connected component (ie. each node can be reached from every other node), so I needed to first split my original graph into connected components, and then find communities in each connected component:
import networkx as nx # following example at https://networkx.github.io/examples.html
sys.path.append("/nfs/users/nfs_a/alc/Documents/git/Python/taynaud-python-louvain-6a3696fdce57/") # this is where I downloaded the 'community' module
import community
# first read the graph into networkx, it is in a file with pairs of chemical compounds that I want to be the nodes, and I want an edge between each pair:
G = nx.Graph() # create a graph in networkxnodeset = set() # set of nodes in our graph already
# read in the input file:
fileObj = open(input_pairs_file, "r")
for line in fileObj:
line = line.rstrip()
temp = line.split("\t")
compound1 = int(temp[0])
compound2 = int(temp[1])
similarity = float(temp[2]))
if compound1 not in nodeset:
G.add_node(compound1)
if compound2 not in nodeset:
G.add_node(compound2)
G.add_edge(compound1,compound2, weight=similarity)
fileObj.close()
# first split the graph into connected components:
subgraphs = list(nx.connected_component_subgraphs(G))
for subgraph in subgraphs:
# find the best partition of the data:
partition = community.best_partition(subgraph) # we can now print out this partition if we want
No comments:
Post a Comment