Solved – Should I use training or testing AUC for selecting best classifier

classificationcross-validationmodelingpredictive-modelsregression

I am using 10-fold cross-validation to build a classifier (logistic regression). For the same data set (which is ~2000 rows), I randomly hold out 10% and run 10-fold C.V. on the remaining 90% for a range of $\lambda$ (ridge) and $\alpha$ (elastic net) values. I run the same model building procedure several times (say 5 or 10), each time randomly selecting a different holdout set to do testing on. Here is a typical run for one of my models with training and testing AUC:

trainAUC            testAUC
0.7789858700489541  0.614762386248736
0.7762811027773526  0.6525764895330113
0.7744834303471625  0.6282312925170068
0.7710854322029923  0.6379084967320261
0.7703260594826858  0.7139756944444444
0.7678740678991903  0.650191570881226
0.7590972626674432  0.7620200622621931
0.7571686726448225  0.750197628458498
0.7492527543821031  0.58
0.7335912555731339  0.7116920842411039

You can see that the training AUC is very consistent, but the testing AUC varies widely from a low of 0.58 to a high of 0.76. This raises a few questions in my mind:

1) Is the high variance in the testAUC simply due to randomness of holdout data selected?

2) If I was forced to select a single model, should I select the model with highest training ROC or test ROC?

3) Would it make sense to create an ensemble classifier which uses each model to make predictions and then averages the predictions?

Note that I am not simply asking for the definition of cross-validation. I know what it is, and I am using it correctly. This is more about model comparison, not parameter selection.

Best Answer

2) If I was forced to select a single model, should I select the model with highest training ROC or test ROC?

How can you select a model if its the same parameter set but just a different iteration? (high variance)

If you run repeated 10-fold CV ("5 or 10 times") and get different AUC values for the same parameter set, then a fair estimation is the worst outcome. Cross-validation in itself is already a pessimistic estimation of the model trained on the entire data set (that was used for CV). Select 0.58 (the lowest test AUC) - selecting the best train or test AUC is probably over-optimistic.

If 0.58 is not good enough, then the model must be made more robust to noise - trading off less variance for the cost of more bias - that makes selecting a model easier. At some point you need a different hold-out set to test since you are somewhat optimizing your testAUC if you keep reiterating.

Related Question