Machine Learning – How to Assess Overfitting in Machine Learning Models

machine learningrandom forest

This is a follow-up of the question I posted earlier.

I am assessing the two RF models which are generated using two different set of features

NF – Test_Accuracy > Training accuracy (500 features)

HF – Test_Accuracy < Training accuracy (125 features)

Testing and Training is done using independent data sets and the accuracy is derived from the average of a 5-cross validation. The difference between the models are in the number of features. I am afraid there could be a possible overfitting in one of the model (It is not clear to me which model could be overfitting, because I have used independent dataset and k-cross validation on the datasets). I would like to know what are the standard methods (tools/libraries) which can be used to assess overfitting.

Best Answer

This result does not mean that you have overfitting.

First of all, CV is more reliable than test set -- you can have (bad) luck in selecting test, what results in (pessimistic) optimistic bias with respect to true accuracy. CV effectively smooths this problem by repeating the procedure of selecting test. What's worse for using the test, RF is a stochastic algorithm and so two runs with different seed will give you different test accuracies, and the difference may be even bigger than that between CV and test.

Second, you may use standard deviation of accuracy from all CV runs to test whether:

  • Your CV accuracy is really different from this on test.
  • One of the feature sets you used is really better than the other.
Related Question