Solved – Understanding random forest, gini, and KS

ginirandom forestroc

I'm a beginner machine learning user, doing my first predictive model using random forest.

I have some questions regarding the way to measure how good a model is (Gini area from roc curve, and KS), and some about random forest algorithm. Thanks a lot for your help.

1) Which metric is better to compare models? Gini or KS? For example, I have two versions of my model, one in which gini is greater than the other but lower KS.

2) Once I run my model, I can see how significant are each variable I created using VarImpPlot in R. All of them "explain" something, but I struggle to know "when my model is optimal". For example, some times I remove one of the variables and it improves, or sometimes I remove not one but a combination of two, and it improves… but some times it gets worse. Is there any way to know when I reached an "optimal combination" given my set of variables?

Thanks a lot

Best Answer

Measuring the quality of any predictive model is heavily dependent on the scope of the problem you are working on. Therefore it is very hard to generally say that one measure of goodness is better than another. However, I think the two questions you are working on would benefit from some cross validation.

In regards to your first question, you can use the fact that random forests automatically give you an estimate of their out-of-bag error. This is essentially an unbiased estimate of how the random forest will perform on unseen data. Thus, instead of trying to pick between your two models based on their splitting quality (an indirect measure of prediction quality), you can pick the model that has the lowest out of bag prediction error. If you are not satisfied with the bootstrapped out of sample data that a random forest uses, you can hold out a portion of totally unseen data and run an additional validation test to choose between the two models. As a side note, once you have chosen the “best” model of the two, you do need a totally fresh testing set so can estimate the out-of-sample prediction error for the best member.

For your second question, identifying which variable combination is optimal, again I would recommend using cross validation. Simply train all of your candidate models (different combos of variables) on the training data set. Then pick the model that performs the best on the out of bag error estimate or on your validation data set. That’s the best guess of which variable set is optimal. Then test the optimal combo on an untouched testing set to get an unbiased estimate of it’s out of sample error.