Friday 24 March 2017

Hierarchical clustering a big data set in R

A while ago I wrote some notes on hierarchical clustering using hclust in R. Now I needed to do hierarchical clustering on a big distance matrix, 26104 x 26104 in size. Will it run? It turns out that hierarchical clustering is O(n^3) and so is slow to run and needs lots of memory for big data sets. I found some discussions about this on the web at  http://stackoverflow.com/. However, I did get it to run, by requesting 25000 Mbyte of RAM! Hurray! And it only took 15 mins to run!
   Note however I started with an even bigger data set and found it too big to run hierarchical clustering on, but managed to break down my clusters into a smaller size first by finding clusters in Python using a community detection algorithm (see finding-communities-in-graph-using.html). Then I was able to run the hierarchical clustering on each of the clusters found by Python (the biggest of which was the 26104-element case above).

No comments: