Solved – Train accuracy < Test accuracy with regularization

logisticoverfittingregularization

With a friend we were playing with the notMNIST data, logistic regression and regularization.

Without regularization, we could achieve a training accuracy (10k samples) of 78%, and test accuracy (15k samples) of 82%.

With regularization, we achieve a training accuracy of 84% and a test accuracy of 88%.

I cannot understand these results: training accuracy is not higher than testing accuracy, so I think there is no overfitting. So, regularization shouldn't help much, but in our case we get a significant improvement.

Can you help me understand what is happening here?
Thanks in advance

Best Answer

Did you look at the distribution of the classes... It may most likely be due imbalanced class distibutions. For example, if you sample contain two class labels 'A', 'B' and if 'A' occurs 80% of the times in your dataset. Assume that your classifier almost always classifies any test data as beloning to class 'A'. Then your training accuracy score is most likely to be around 0.8., However, since, you are chosing your test samples in random, if by some means, the number of samples belonging to Class 'A' is more than than the number of samples belonging to Class 'B', assuming a 90/10 ratio, then your test accuracy would be 0.9 i.e test accuracy > training accuracy.

Typically, you'd have a low cross validation score and if you are using python scikit-learn and use StratifiedKFold, for some values of K you would receive warning messages.

Related Question