Solved – Random forest clustering

clusteringrrandom forest

In my data the classes were defined by binning a variable in 10 bins. After growing the random forest its proximity matrix is viewed as the following MDSplot:

RF MDSplot

As can be seen from the plot all classes are overlapped in all clusters. I wonder if it is possible to use this proximity matrix or any data from R randomForest object to go back to the data and re-define the classes so they would be minimally overlapped in clusters in this MDSplot, i.e. each class would reside in its own cluster.

Best Answer

I know that you asked R solutions, but in python, specifically scikit-learn, there's an interesting class that implements a Random forest embedding. It constructs a random forest without class label infomation. As the output you get a new dataset, where your objects are embedded in a binary feature space. In this space you have a feature for each leaf of each tree of the random forest (a huge, depending on how many trees you use, sparse feature space). Each feature is one when the object falls in that specific leaf of that specific random tree.

The idea is that if two objects fall consistently in the same leaves across the forest it is likely that they are somewhat similar. You can then visualize this new dataset with an MDS plot, for example using the hamming distance.

This is a non-linear embedding, so it colud be likely that the forest is able to disentangle your objects and obtain a meaningful clustering.

Or, as I am not a fan of clustering in spaces developed for visualizations, you can use directly some clustering algorithm that works with non-euclidean distances (do not use the k-means!) and try to get a more interesting clustering.

Related Question