Solved – What if high validation accuracy but low test accuracy in research

cross-validationmachine learningreproducible-research

I have a specific question about validation in machine learning research.

As we know, the machine learning regime asks researchers to train their models on the training data, choose from candidate models by validation set, and report accuracy on the test set. In a very rigorous study, the test set can only be used once. However, it can never be the research scenario, because we have to improve our performance until the test accuracy is better than state-of-the-art results before we can publish (or even submit) a paper.

Now comes the problem. Let's say 50% is the most state-of-the-art result, and my model can generally achieve 50–51 accuracy, which is better on average.

However, my best validation accuracy (52%) yields a very low test accuracy, e.g., 49%. Then, I have to report 49% as my overall performance if I can't further improve the validation acc, which I think is of no hope. This really prevents me from studying the problem, but it doesn't matter to my peers, because they don't see the 52% acc, which I think is an outlier.

So, how do people usually do in their research?

p.s. k-fold validation is of no help, because the same situation may still happen.

Best Answer

By definition, when training accuracy (or whatever metric you are using) is higher than your testing you have an overfit model. In essence, your model has learned particulars that help it perform better in your training data that are not applicable to the larger data population and therefore result in worse performance.

I’m not sure why you say k-fold validation wouldn’t be helpful. Its’ purpose is to help avoid over fitting your models. Perhaps you don’t have enough data? A statement like this is important, especially if you are going to defend any research when such cross-validation methods are highly recommended.

You say you aren’t able to use the test set just once (again I assume smaller sample size?). In my experience the most common path followed is k-fold cross-validation of you model. Let’s take an example with 10-fold CV for a sample size of 100 and assume your classification problem is binary to make the calculations simple. I therefore have split my data in to 10 different folds. I then fit my model to 9/10 folds and then predict the 1/10 I left out. For this first run, the resulting confusion matrix is:

    0  1
0   4  1
1   2  3

I then repeat this analysis again with the next 1/10 fold left out and train on the other 9/10. And get my next confusion matrix. Once completed, I have 10 confusion matrices. I would then sum these matrices (so I had all 100 samples predicted) and then report my statistics (Accuracy, PPV, F1-score, Kappa, etc.). If your accuracy is not where you want it to be there are many other possibilities.

  1. Your model needs be improved (change parameters)
  2. You may need to try a different machine learning algorithm (not all algorithms created equal)
  3. You need more data (subtle relationship difficult to find)
  4. You may need to try transforming your data (dependent upon algorithm used)
  5. There may be no relationship between your dependent and independent variables

The fact of the matter is, a lower testing metric (e.g. accuracy) than your training is indicative of overfitting your model not something you want when trying to create a new predictive model.

Related Question