Solved – Why the validation accuracy and AUC are higher than the training accuracy and AUC

boostingcross-validationtrainvalidation

I have a binary classification problem and I use LightGBM classifier to build my model based on 5 features. I divided my dataset (94 observations) into two parts:

  1. Training dataset: 60 observations
  2. Validation dataset: 34 observations

I use only my training dataset to tune hyperparameters of LightGBM classifier by using GridSearchCV and 5-fold cross-validation. Then I test my model in terms of accuracy and AUC on the validation dataset and these are the results:

Accuracy score (train): 0.633

Accuracy score (validation): 0.706

ROC AUC (train): 0.791

ROC AUC (validation): 0.869

You see my AUC of validation dataset is higher than my training! It seems surprising to me and I think something is wrong here. I never exposed validation dataset to GridSearchCV or any other step to build my classifier. My question is: Is it logical to have a higher ROC AUC on validation dataset in comparison to training dataset or something is wrong here obviously that I can't see it. Any idea or suggestion is appreciated.

Update

I used a different random split and these are my results:

Accuracy score (train): 0.767

Accuracy score (validation): 0.706

ROC AUC (train): 0.838

ROC AUC (validation): 0.735

Now, it's opposite and my training ROC AUC and accuracy are higher than validation dataset. Any idea or suggestion? Am I seeing a huge instability in my model due to these results?

Best Answer

There is no discrepancy here, uncertainty due to the small number of tested cases is more than sufficient to explain the situation:

As a quick check of the situation, we can calculate binomial confidence intervals (for the moment ignoring that there may be additional random uncertainty due to model instability) for accuracy based on number of tested cases:

training: $63.3 \% = \frac{38}{60}$. 95% confidence interval 50 - 74 %
validation: $70.6 \% = \frac{24}{34}$. 95% confidence interval 54 - 84 %

With that overlap in the confidence intervals, there's no way to argue that the validation set accuracy is actually higher than trainins set accuracy.

Note that this also means that your sample size is too small to make meaningful comparisons of the predictive performance in order to select the best model unless the difference in the models is huge. To get an idea about the sample sizes needed, have a look at our paper
Beleites, C. and Neugebauer, U. and Bocklitz, T. and Krafft, C. and Popp, J.: Sample size planning for classification models. Anal Chim Acta, 2013, 760, 25-33. DOI: 10.1016/j.aca.2012.11.007
accepted manuscript on arXiv: 1211.1323

Nevertheless selecting the apparently best model will tend to produce overfitting as your decision heavily influenced by the random uncertainty of your test results: if you take many different models that truly have the same predictive performance, and measure their performance only roughly, you'll still observe better and worse performance estimates. But picking among them cannot produce any truly better model. Almost the same happens if you have models that differ so little in their predictive performance that you cannot reliably detect the difference given your sample size.
This is overfitting in the very sense of the word - actually, it's what I'd consider one of the textbook situations of overfitting.

What you can and should do, though, is measuring model stability. That can be done even if the random uncertainty for the overall accuracy etc. is large due to your small sample size:
Beleites, C. & Salzer, R.: Assessing and improving the stability of chemometric models in small sample size situations Anal Bioanal Chem, 2008, 390, 1261-1271.
DOI: 10.1007/s00216-007-1818-6

This would require changing the outer train/test split into a repeated cross validation. As you did the train-test split yourself (as opposed to, say, having a clinical study where patients have been assigned to train and test arms beforehand), repeated cross validation is much better here anyways as it will lead to each and every sample being tested (and thus helping to narrow the random uncertainty: the same 70.6 % accuracy observed in a cross validation comprising all 94 cases would have 95 % ci 61 - 79 %).

Once you know whether your models are stable or not, you can choose a strategy to improve them based on this information. E.g. if they are, you may consider moving from boosting to more restrictive models or even bagging.


There are approaches to improve the grid search selection procedure (e.g. better behaved figure of merit: a proper scoring rule, and choosing the least complex model that's better than the apparently best performance taking into account random uncertainty on that see here) - but they cannot work miracles. Still back-of-the-envelope of your accuracy, such a least complex model would have only slightly better than guessing performance which is most probabaly not an acceptable result for your application.

I think it very likely that even with a more stable training algorithm than boosting and a proper scoring as your figure of merit, you may find that your data set just doesn't have the required information content to allow data-driven tuning of model parameters. Thus, you may be overall better off using either a modeling algorithm that doesn't need tuning or one that can be tuned by your expert knowledge about your data and the algorithm.