I wanted to find some clusters (communities) in a graph. A while ago I figured out how to do this in R, and it turns out my R script still works, hurray! Here goes...
What are communities?
A property common to many types of graphs, including protein-protein interaction graphs,
is community structure. A community is often defined as a subset of the
vertices in the graph such that connections btween the vertices are denser than connections
with the rest of the graph. For example, the graph may consist of one
connected component. However, within that connected component, we may have three densely
connected subgraphs; these could be said to be three different communities within the graph.
In terms of protein-protein interaction networks,
if there are several communities within a connected component (for
example, three communities), these could represent three different
groups of proteins, where the proteins within one community interact much more with each other
than with proteins in the other communities.
By detecting communities within a protein-protein
interaction graph, we can detect putative protein complexes, that is, groups of associated
proteins that are probably fairly stable over time. In other words, protein complexes can be
detected by looking for groups of proteins among which there are many interactions, and where
the members of each group have few interactions with other proteins that do not belong to that
group.
There are lots of different methods available for detecting communities in a graph, and each
method will give slightly different results. That is, the particular method used for detecting communities will
decide how you split a connected component into one or more communities.
Input data file
I have an input file 'testset' like this, that lists the node pairs with edges between them:
D915_00007 D915_00016
D915_00016 D915_00006
D915_00006 D915_00015
D915_00015 D915_00005
D915_00005 D915_00019
D915_00019 D915_00034
D915_00034 D915_00050
D915_00050 D915_00049
D915_00049 D915_00111
D915_00111 D915_00048
First install the RBGL library
Then in R I first need to install RBGL:
> source("http://bioconductor.org/biocLite.R") # needed for finding connected components
> biocLite("RBGL")
Then I can type:
> source("Rfunctions.R") # this file is included below
> library("RBGL")
> mygraph <- makeproteingraph("testset") # takes in my input file
> findcommunities(mygraph, 1) # find communities of size at least 1
This gives me output:
[1] "Community 0 : D915_00049 D915_00050 D915_00111 D915_00048"
[1] "Community 1 : D915_00006 D915_00007 D915_00015 D915_00016"
[1] "There were 2 communities in the input graph"
Note that if you run findcommunities() again and
again on the same input graph, it might find slightly different sets of communities each time. This is because
it uses a random number generator in the method that it uses for identifying communities, and the random
number used will be different each time you run the findcommunities() function, which means that
you will get slightly different answers each time. The answers should be very similar, however, but you
might see a small difference, for example, a large community found the first time you run it
might be split into two smaller communities the second time you run it.
My file Rfunctions.R with the functions used above
Below I give code for my file Rfunctions.R that contains one function findcommunities()
that identifies communities within a graph (or subgraph of a graph).
The function findcommunities() uses the function spinglass.community() from the "igraph"
library to identify communities in a graph or subgraph.
As its arguments (inputs), the findcommunities() function takes the graph/subgraph that we want to find
communities in, and the minimum number of vertices that a community must have to be reported.
# Function to make a graph based on protein-protein interaction data in an input file
makeproteingraph <- function(myfile)
{
library("graph")
mytable <- read.table(file(myfile)) # Store the data in a data frame
proteins1 <- mytable$V1
proteins2 <- mytable$V2
protnames <- c(levels(proteins1),levels(proteins2))
# Find out how many pairs of proteins there are
numpairs <- length(proteins1)
# Find the unique protein names:
uniquenames <- unique(protnames)
# Make a graph for these proteins with no edges:
mygraph <- new("graphNEL", nodes = uniquenames)
# Add edges to the graph:
# See http://rss.acs.unt.edu/Rdoc/library/graph/doc/graph.pdf for more examples
weights <- rep(1,numpairs)
mygraph2 <- addEdge(as.vector(proteins1),as.vector(proteins2),mygraph,weights)
return(mygraph2)
}
# Function to find network communities in a graph
findcommunities <- function(mygraph,minsize)
{
# Load up the igraph library:
library("igraph")
# Set the counter for the number of communities:
cnt <- 0
# First find the connected components in the graph:
myconnectedcomponents <- connectedComp(mygraph)
# For each connected component, find the communities within that connected component:
numconnectedcomponents <- length(myconnectedcomponents)
for (i in 1:numconnectedcomponents)
{
component <- myconnectedcomponents[[i]]
# Find the number of nodes in this connected component:
numnodes <- length(component)
if (numnodes > 1) # We can only find communities if there is more than one node
{
mysubgraph <- subGraph(component, mygraph)
# Find the communities within this connected component:
# print(component)
myvector <- vector()
mylist <- findcommunities2(mysubgraph,cnt,"FALSE",myvector,minsize)
cnt <- mylist[[1]]
myvector <- mylist[[2]]
}
}
print(paste("There were",cnt,"communities in the input graph"))
}
# Function to find network communities in a connected component of a graph
findcommunities2 <- function(mygraph,cnt,plot,myvector,minsize)
{
# Find the number of nodes in the input graph
nodes <- nodes(mygraph)
numnodes <- length(nodes)
# Record the vertex number for each vertex name
myvector <- vector()
for (i in 1:numnodes)
{
node <- nodes[i] # "node" is the vertex name, i is the vertex number
myvector[`node`] <- i # Add named element to myvector
}
# Create a graph in the "igraph" library format, with numnodes nodes:
newgraph <- graph_from_graphnel(mygraph)
# Now find communities in the graph:
communities <- spinglass.community(newgraph)
# Find how many communities there are:
sizecommunities <- communities$csize
numcommunities <- length(sizecommunities)
# Find which vertices belong to which communities:
membership <- communities$membership
# Get the names of vertices in the graph "newgraph":
vertexnames <- get.vertex.attribute(newgraph, "name")
# Print out the vertices belonging to each community:
for (i in 1:numcommunities)
{
if (plot == TRUE) { cnt <- cnt + 1 }
nummembers <- 0
printout <- paste("Community",cnt,":")
for (j in 1:length(membership))
{
community <- membership[j]
community <- community + 1
if (community == i) # If vertex j belongs to the ith community
{
vertexname <- vertexnames[j]
if (plot == FALSE)
{
nummembers <- nummembers + 1
# Print out the vertices belonging to the community
printout <- paste(printout,vertexname)
}
else
{
# Colour in the vertices belonging to the community
myvector[`vertexname`] <- cnt
}
}
}
if (plot == FALSE && nummembers >= minsize)
{
cnt <- cnt + 1
print(printout)
}
}
return(list(cnt,myvector))
}
No comments:
Post a Comment