Solved – Unsupervised Clustering using randomForest

clusteringrrandom forestunsupervised learning

Outline of clustering technique using Random Forest

A synthetic data is created by randomly sampling from the data of interest. It is used as the base line to measure the "structureness" or "clustering" in the data of interest.
The real data and synthetic data are combined and fed into the randomForest() to do classification.The distance matrix is calculated from the proximity measure of the outputted random forest and is clustering is done on the distance matrix using other clustering techniques.

Issues:

1)Sampling technique

The paper by Shi et al. (http://labs.genetics.ucla.edu/horvath/RFclustering/RFclustering/RandomForestHorvath.pdf)
describes two sampling techniques- (1)random sampling from the product of empirical marginal distributions of the variables of the data and (2)random sampling (uniform distribution) from the hyper rectangle containing the data.

2)No. of forests

Shi et al. reported that "RF dissimilarity can vary considerably as a function of the particular realization of the synthetic data". So a number of forests are grown and are combined to get the final result.

Question:

Which sampling technique does randomForest() function from randomForest package uses ? Also, how many forests are grown?

Best Answer

  • I've read that article before. It's a frustrating article that seemed to me to be misleading. Perhaps the the R and Fortran codes were different at the time of writing. They explicitly state that both methods are available in the R package randomForest but I couldn't find it. The website implies that the second sampling method is used. The only way to be certain may be to look at the underling code.
  • The issue of the number of forests is also poorly dealt with. They use some form of CV - if I remember right - to estimate the number of forests and then provide a rule of thumb based on this CV. But it (1) clearly depends on the starting sample size how stable the synthetic dataset is going to be. (2) you don't - as they admit - have to built multiple forests, but create a sufficiently large synthetic dataset. The issue is how large a synthetic dataset do you need.
Related Question