Solved – Training accuracy equal to test accuracy, how to interpret

cross-validationmachine learningmaximum likelihoodoverfitting

As part of my Master Thesis, I'm designing a relatively simple Biomes Distribution Model. The aim of the model is to predict biomes (macro vegetation types) spatial distribution on Earth, using some (mainly climatic) data as input.
I have 24 different variables I can use as input for the model. First I ran principal component analysis http://desktop.arcgis.com/en/arcmap/latest/tools/spatial-analyst-toolbox/how-principal-components-works.htm to sort predictor variables in order of importance. Then I performed a forward selection, starting with the simplest model having just one predictor, up to the most complex model having all 24 input variables. Model is based on Maximum Likelihood Classification:

h ttp://desktop.arcgis.com/en/arcmap/latest/tools/spatial-analyst-toolbox/how-maximum-likelihood-classification-works.htm

(sorry, don't have enough reputation to post another link…).

I used groundtruth data (which I have available for the whole Earth) as training data. For every model, from the simplest to the most complex, I ran the classification and calculated Cohen's Kappa statistics as a measure of goodness of fit. I got this training accuracy curve: training accuracy

As I had the suspicion that using all 24 variables model was overfitting, I wrote the code to perform a 3-fold cross validation (as explained on Wikipedia's Cross-validation article) and re-ran calculations to get the test (or validation) accuracy curve. I was expecting that this new curve would be decreasing after a global maximum for the effects of overfitting. Instead, test accuracy curve came out (almost) identical to the training curve. How to interpret this result? Does it mean that there was no overfitting in the first place?

Best Answer

Yes, if the test set performance does not dip below the training set performance under any circumstances, then you don't have over-fitting..

But it was still the right move to measure performance on unseen data. That should always be done. Training set performance is much more optional, most of the time you wouldn't even look at it.

I'm not entirely sure what you mean with ground truth data of the earth that composes your training set but that it seems you later divided into training set and test-set. It matters how you do this division. You cannot expect to train only on data regarding Europe and test on Asia for example. Your training data would not be representative of the test cases. (But as your test performance is good, you seem to be fine here.)

PS. PCA wouldn't sort your variables in order of importance. Importance would be defined in terms of how indicative they are of the class label and PCA does not look at the class label.