Solved – k-fold cross validation AUC score vs test AUC score

cross-validationmachine learningrocscikit learnvalidation

I've split my data into a training and test set (75% and 25% split respectively), and have subsequently performed a 5-fold cross validation on the training set using GridSearchCV in order to find the optimal hyperparameters for four different classification models (LR, RF, SVM and k-NN). I have then yielded their corresponding validation, training and testing AUC scores.

However, I am confused about which AUC score to pay attention to when it comes to describing the performance of the classifiers as well as deciding which one is the most optimal one – do I pay attention to the validation AUC score or to the test AUC score? Should it be the former case, what is the point of splitting the data into a test and training set in the first place?

There are countless examples online where data splitting is always performed, where the mean validation AUC and test AUC scores are found, and where the ROC curve is drawn for the test set, but everybody always seems to neglect the explanation of which AUC score one should focus on when choosing the optimal classification model, and most importantly, why. Does the test set act as some form of reassurance?

In my situation I obtain the following AUC scores:

SVM training AUC: 0.727
SVM validation AUC: 0.703
SVM test AUC: 0.762

RF training AUC: 1.000
RF validation AUC: 0.791
RF test AUC: 0.625

LR training AUC: 0.776
LR validation AUC: 0.689
LR test AUC: 0.737

k-NN training AUC: 0.895
k-NN validation AUC: 0.792
k-NN test AUC: 0.646

Best Answer

You should pay attention to measures of your models' performance on the test data set.

I think that your question stems from confusion over the purpose of the training, validation and test sets. Two related questions on this site you might look at are What is the difference between test set and validation set and Difference between training, test and holdout set data mining model building.

For the specifics of your question, it suffices to run through the purpose of each data set.

  1. Training Data Set: this is the data set that you use to build your model. In this case SVM, RF, LR or k-NN. We don't simply accept this model, however, because it may be underfitted or overfitted to the training data set.

  2. Validation Data Set: this is the data set that you use to select model parameters. Different models have different parameters that might need to be tuned (such a the 'k' in k-NN or the number of trees in RF) and the validation data set is used to choose select the best values for these parameters. Unfortunately, this means that this data is used to train/select the model and performance on the validation data set can't be used to judge the quality of the model for the same reasons that performance on the training data set couldn't be used.

  3. Test Data Set: this is the data set that you use to judge the quality of your model. It has not been seen at any step of the training or model selection process and should provide a good test of your model's ability to capture the relevant patterns in the data while also generalizing to unseen examples.

To summarize: the test data set does not act as a reassurance, it acts as your evaluation. It is the first time that your model is faced with data that has not been used to train/select the model and it is how you are able to judge your model's performance.