Thursday 2 February 2017

Finding communities (clusters) in a graph

I wanted to find some clusters (communities) in a graph. A while ago I figured out how to do this in R, and it turns out my R script still works, hurray! Here goes...

What are communities?
A property common to many types of graphs, including protein-protein interaction graphs, is community structure. A community is often defined as a subset of the vertices in the graph such that connections btween the vertices are denser than connections with the rest of the graph. For example, the graph may consist of one connected component. However, within that connected component, we may have three densely connected subgraphs; these could be said to be three different communities within the graph.

In terms of protein-protein interaction networks, if there are several communities within a connected component (for example, three communities), these could represent three different groups of proteins, where the proteins within one community interact much more with each other than with proteins in the other communities.

By detecting communities within a protein-protein interaction graph, we can detect putative protein complexes, that is, groups of associated proteins that are probably fairly stable over time. In other words, protein complexes can be detected by looking for groups of proteins among which there are many interactions, and where the members of each group have few interactions with other proteins that do not belong to that group.

There are lots of different methods available for detecting communities in a graph, and each method will give slightly different results. That is, the particular method used for detecting communities will decide how you split a connected component into one or more communities.

Input data file
I have an input file 'testset' like this, that lists the node pairs with edges between them:
D915_00007 D915_00016
D915_00016 D915_00006
D915_00006 D915_00015
D915_00015 D915_00005
D915_00005 D915_00019
D915_00019 D915_00034
D915_00034 D915_00050
D915_00050 D915_00049
D915_00049 D915_00111
D915_00111 D915_00048


First install the RBGL library
Then in R I first need to install RBGL:
> source("http://bioconductor.org/biocLite.R") # needed for finding connected components
> biocLite("RBGL")

Then I can type:
> source("Rfunctions.R") # this file is included below
> library("RBGL")
> mygraph <- makeproteingraph("testset") # takes in my input file
> findcommunities(mygraph, 1)    # find communities of size at least 1

This gives me output:
[1] "Community 0 : D915_00049 D915_00050 D915_00111 D915_00048"
[1] "Community 1 : D915_00006 D915_00007 D915_00015 D915_00016"
[1] "There were 2 communities in the input graph"


Note that if you run findcommunities() again and again on the same input graph, it might find slightly different sets of communities each time. This is because it uses a random number generator in the method that it uses for identifying communities, and the random number used will be different each time you run the findcommunities() function, which means that you will get slightly different answers each time. The answers should be very similar, however, but you might see a small difference, for example, a large community found the first time you run it might be split into two smaller communities the second time you run it. 

My file Rfunctions.R with the functions used above

Below I give code for my file Rfunctions.R that contains one function findcommunities() that identifies communities within a graph (or subgraph of a graph). The function findcommunities() uses the function spinglass.community() from the "igraph" library to identify communities in a graph or subgraph. As its arguments (inputs), the findcommunities() function takes the graph/subgraph that we want to find communities in, and the minimum number of vertices that a community must have to be reported. 

# Function to make a graph based on protein-protein interaction data in an input file
makeproteingraph <- function(myfile)
{
   library("graph")
   mytable <- read.table(file(myfile)) # Store the data in a data frame
   proteins1 <- mytable$V1
   proteins2 <- mytable$V2
   protnames <- c(levels(proteins1),levels(proteins2))
   # Find out how many pairs of proteins there are
   numpairs <- length(proteins1)
   # Find the unique protein names:
   uniquenames <-  unique(protnames)
   # Make a graph for these proteins with no edges:
   mygraph <- new("graphNEL", nodes = uniquenames)
   # Add edges to the graph:
   # See http://rss.acs.unt.edu/Rdoc/library/graph/doc/graph.pdf for more examples
   weights <- rep(1,numpairs)
   mygraph2 <- addEdge(as.vector(proteins1),as.vector(proteins2),mygraph,weights)
              
   return(mygraph2)     
}


# Function to find network communities in a graph
findcommunities <- function(mygraph,minsize)
{
   # Load up the igraph library:
   library("igraph")
   # Set the counter for the number of communities:
   cnt <- 0
   # First find the connected components in the graph:
   myconnectedcomponents <- connectedComp(mygraph)
   # For each connected component, find the communities within that connected component:
   numconnectedcomponents <- length(myconnectedcomponents)
   for (i in 1:numconnectedcomponents)
   {
      component <- myconnectedcomponents[[i]]
      # Find the number of nodes in this connected component:
      numnodes <- length(component)
      if (numnodes > 1) # We can only find communities if there is more than one node
      {
         mysubgraph <- subGraph(component, mygraph)
         # Find the communities within this connected component:
         # print(component)
         myvector <- vector()
         mylist <- findcommunities2(mysubgraph,cnt,"FALSE",myvector,minsize)
         cnt <- mylist[[1]]
         myvector <- mylist[[2]]
      }
   }
   print(paste("There were",cnt,"communities in the input graph"))
}


# Function to find network communities in a connected component of a graph   
findcommunities2 <- function(mygraph,cnt,plot,myvector,minsize)
{
   # Find the number of nodes in the input graph
   nodes <- nodes(mygraph)
   numnodes <- length(nodes)

   # Record the vertex number for each vertex name
   myvector <- vector()
   for (i in 1:numnodes)

   {
      node <- nodes[i] # "node" is the vertex name, i is the vertex number
      myvector[`node`] <- i  # Add named element to myvector
   }

   # Create a graph in the "igraph" library format, with numnodes nodes:
   newgraph <- graph_from_graphnel(mygraph)

   # Now find communities in the graph:
   communities <- spinglass.community(newgraph)
   # Find how many communities there are:
   sizecommunities <- communities$csize
   numcommunities <- length(sizecommunities)
   # Find which vertices belong to which communities:
   membership <- communities$membership
   # Get the names of vertices in the graph "newgraph":
   vertexnames <- get.vertex.attribute(newgraph, "name")

# Print out the vertices belonging to each community:
   for (i in 1:numcommunities)
   {
      if (plot == TRUE) { cnt <- cnt + 1 }
      nummembers <- 0
      printout <- paste("Community",cnt,":")
      for (j in 1:length(membership))
      {
         community <- membership[j]
         community <- community + 1
         if (community == i) # If vertex j belongs to the ith community
         {
            vertexname <- vertexnames[j]
            if (plot == FALSE)
            {
               nummembers <- nummembers + 1
               # Print out the vertices belonging to the community
               printout <- paste(printout,vertexname)
            }
            else
            {
               # Colour in the vertices belonging to the community
               myvector[`vertexname`] <- cnt  
            }
         }
      }
      if (plot == FALSE && nummembers >= minsize)
      {
         cnt <- cnt + 1
         print(printout)
      }
   }

   return(list(cnt,myvector))
}