Outline of clustering technique using Random Forest
A synthetic data is created by randomly sampling from the data of interest. It is used as the base line to measure the "structureness" or "clustering" in the data of interest.
The real data and synthetic data are combined and fed into the randomForest() to do classification.The distance matrix is calculated from the proximity measure of the outputted random forest and is clustering is done on the distance matrix using other clustering techniques.
Issues:
1)Sampling technique
The paper by Shi et al. (http://labs.genetics.ucla.edu/horvath/RFclustering/RFclustering/RandomForestHorvath.pdf)
describes two sampling techniques- (1)random sampling from the product of empirical marginal distributions of the variables of the data and (2)random sampling (uniform distribution) from the hyper rectangle containing the data.
2)No. of forests
Shi et al. reported that "RF dissimilarity can vary considerably as a function of the particular realization of the synthetic data". So a number of forests are grown and are combined to get the final result.
Question:
Which sampling technique does randomForest() function from randomForest package uses ? Also, how many forests are grown?
Best Answer
R
andFortran
codes were different at the time of writing. They explicitly state that both methods are available in theR
packagerandomForest
but I couldn't find it. The website implies that the second sampling method is used. The only way to be certain may be to look at the underling code.