Training Data – Impact of Increasing Training Data on System Accuracy in Machine Learning

classificationdatasetmachine learningprecision-recall

Can someone summarize for me with possible examples, at what situations increasing the training data improves the overall system?
When do we detect that adding more training data could possibly over-fit data and not give good accuracies on the test data?

This is a very non-specific question, but if you want to answer it specific to a particular situation, please do so.

Best Answer

In most situations, more data is usually better. Overfitting is essentially learning spurious correlations that occur in your training data, but not the real world. For example, if you considered only my colleagues, you might learn to associate "named Matt" with "has a beard." It's 100% valid ($n=4$, even!) when considering only the small group of people working on floor, but it's obviously not true in general. Increasing the size of your data set (e.g., to the entire building or city) should reduce these spurious correlations and improve the performance of your learner.

That said, one situation where more data does not help---and may even hurt---is if your additional training data is noisy or doesn't match whatever you are trying to predict. I once did an experiment where I plugged different language models[*] into a voice-activated restaurant reservation system. I varied the amount of training data as well as its relevance: at one extreme, I had a small, carefully curated collection of people booking tables, a perfect match for my application. At the other, I had a model estimated from huge collection of classic literature, a more accurate language model, but a much worse match to the application. To my surprise, the small-but-relevant model vastly outperformed the big-but-less-relevant model.


A surprising situation, called **double-descent**, also occurs when size of the training set is close to the number of model parameters. In these cases, the test risk first decreases as the size of the training set increases, transiently *increases* when a bit more training data is added, and finally begins decreasing again as the training set continues to grow. This phenomena was reported 25 years in the neural network literature (see Opper, 1995), but occurs in modern networks too ([Advani and Saxe, 2017][1]). Interestingly, this happens even for a linear regression, albeit one fit by SGD ([Nakkiran, 2019][2]). This phenomenon is not yet totally understood and is largely of theoretical interest: I certainly wouldn't use it as a reason not to collect more data (though I might fiddle with the training set size if n==p and the performance were unexpectedly bad).
[*]A language model is just the probability of seeing a given sequence of words e.g. $P(w_n = \textrm{'quick', } w_{n+1} = \textrm{'brown', } w_{n+2} = \textrm{'fox'})$. They're vital to building halfway decent speech/character recognizers.