Solved – How to perform unsupervised Random Forest classification using Breiman’s code

classificationmachine learningrandom forest

I am working with Breiman's random forest code (http://stat-www.berkeley.edu/users/breiman/RandomForests/cc_manual.htm#c2) for classification of satellite data (supervised learning). I am using a training and test dataset having sample size of 2000 and variable size 10. The data is classified into two classes, A and B. In supervised learning mode, the algorithm is performing well with very low classification error (<2%). Now I want to try the unsupervised classification with no class labels in the test data set and see how the algorithm is able to predict the classes. Is there a way to implement unsupervised classification using Breiman's code? Will the error from this method will be higher than supervised classification?
The data and run parameter setting in the algorithm are given below

DESCRIBE DATA
1 mdim=10,ntrain=2000,nclass=2,maxcat=1,
1 ntest=2000,labelts=1,labeltr=1,

SET RUN PARAMETERS
2 mtry0=3,ndsize=1,jbt=500,look=100,lookcls=1,
2 jclasswt=0,mdim2nd=0,mselect=0,

Best Answer

Given that your model exhibits good accuracy you can just use it to predict the class labels of records in the unlabeled dataset. However, you cannot evaluate the performances on unlabeled data.

Be careful that you should assess the quality of your model on the labeled data by cross-validation. It is not enough to check the training error rate.

If your model is not accurate enough you might think about semi-supervised learning. The unlabeled data is used in order to improve the quality of your model via inductive learning. The accuracy should always be computed by cross-validation on your labeled data.

Have a look at [ Crimisini et al. Decision Forests: A Unified Framework for Classification, Regression, Density Estimation, Manifold Learning and Semi-Supervised Learning ] Chapter 7 about semi-supervised learning and 7.4 about induction with semi-supervised learning.

Related Question