Solved – Learning curves – Why does the training accuracy start so high, then suddenly drop

classificationcross-validationlogisticscikit learn

I implemented a model in which I use Logistic Regression as classifier and I wanted to plot the learning curves for both training and test sets to decide what to do next in order to improve my model.

Just to give you some information, to do plot the learning curve I defined a function that takes a model, a pre-split dataset (train/test X and Y arrays, NB: using train_test_split function), a scoring function as input and iterates through the dataset training on n exponentially spaced subsets and returns the learning curves.

My results are in the below image
enter image description here

I wonder why does the training accuracy start so high, then suddenly drop, then start to rise again as training set size increases? And conversely for the test accuracy. I thought extremely good accuracy and the fall was because of some noise due to small datasets in the beginning and then when datasets became more consistent it started to rise but I am not sure. Can someone explain this?

And finally, can we assume that these results mean a low variance/moderate bias (70% accuracy in my context is not that bad) and so to improve my model I must resort to ensemble methods or extreme feature engineering?

Best Answer

It is normal that your training accuracy goes down when the dataset size grows. Think of it this way: when you have fewer samples (imagine that you have just one, at the extreme) it is easy to fit a model that has good accuracy for the training data, however that fitted model is not going to generalize well for test data. As you increase the dataset size, in general it is going to be harder to fit the training data, but hopefully your results generalize better for the test data. So the shapes of your curves look fine.

Yes it is true that your training accuracy increases a bit when your dataset gets really big, but I would say this is happening by chance, because of the concrete data you are adding in that particular split. In practice, learning curves are never as perfect as one would expect in theory, and the plot you show looks actually very good. In order to convince yourself, just change the seed for the way you are doing the split of the data. I'd bet you'll see a curve with more or less a similar shape, but maybe the training accuracy increases a bit e.g. just at the end of curve, or in other unexpected place.

Actually from your curve you can see that above 500 samples you are basically not improving your accuracy. Which means indeed that your problem could be bias and not variance, and you could consider increasing the complexity of your model.

In this tutorial you will find some more explanations.

Hope it helps.