I'm a newbie and learning ML. I've a doubt, normally we know we should increase the size of training dataset or should add more data to reduce variance (fairly understood why). Now variance has inverse relationship with bias, so it means when we're adding more data, we're reducing variance – or we're increasing bias. Then, why this is not possible to reduce bias by reducing the number of training samples. Could someone please explain me.
Bias – Does Reducing Training Dataset Size Decrease Bias?
biasbias-variance tradeoffmachine learningvariance
Related Solutions
In most situations, more data is usually better. Overfitting is essentially learning spurious correlations that occur in your training data, but not the real world. For example, if you considered only my colleagues, you might learn to associate "named Matt" with "has a beard." It's 100% valid ($n=4$, even!) when considering only the small group of people working on floor, but it's obviously not true in general. Increasing the size of your data set (e.g., to the entire building or city) should reduce these spurious correlations and improve the performance of your learner.
That said, one situation where more data does not help---and may even hurt---is if your additional training data is noisy or doesn't match whatever you are trying to predict. I once did an experiment where I plugged different language models[*] into a voice-activated restaurant reservation system. I varied the amount of training data as well as its relevance: at one extreme, I had a small, carefully curated collection of people booking tables, a perfect match for my application. At the other, I had a model estimated from huge collection of classic literature, a more accurate language model, but a much worse match to the application. To my surprise, the small-but-relevant model vastly outperformed the big-but-less-relevant model.
A surprising situation, called **double-descent**, also occurs when size of the training set is close to the number of model parameters. In these cases, the test risk first decreases as the size of the training set increases, transiently *increases* when a bit more training data is added, and finally begins decreasing again as the training set continues to grow. This phenomena was reported 25 years in the neural network literature (see Opper, 1995), but occurs in modern networks too ([Advani and Saxe, 2017][1]). Interestingly, this happens even for a linear regression, albeit one fit by SGD ([Nakkiran, 2019][2]). This phenomenon is not yet totally understood and is largely of theoretical interest: I certainly wouldn't use it as a reason not to collect more data (though I might fiddle with the training set size if n==p and the performance were unexpectedly bad).
[*]A language model is just the probability of seeing a given sequence of words e.g. $P(w_n = \textrm{'quick', } w_{n+1} = \textrm{'brown', } w_{n+2} = \textrm{'fox'})$. They're vital to building halfway decent speech/character recognizers.
You are referring to what is known as Sample Complexity in the PAC learning framework. There has been significant amount of research in this area. In summary, in most real world cases, you never know what the true sample complexity is for a given dataset, however, you can bound it. The bounds typically are very loose and do not usually convey anything more than the order of examples required, to reach a particular error with a particular probability.
For instance, to reach a prediction error within epsilon, with a large probability (1 - delta), you may need the number of samples proportional to some function of epsilon and delta. For example, if your sample complexity is O(1/epsilon) you are better off than your complexity being (1/epsilon^2). That is, to reach 1% error rate, in the former case you need O(100) examples, and O(10000) in the second. But remember, these are still O(.) and not exact numbers.
If you look up sample complexity bounds of particular classes of algorithms, you'd get some idea. Some lecture notes here.
Best Answer
Not necessary. A picture is worth a thousand words, so let me use the image below. (Check also the Intuitive explanation of the bias-variance tradeoff? thread.)
Imagine your model is an oracle that perfectly predicts the target, it will have no bias and no variance.
Imagine a model that always predicts the same constant (say, $42$), it will be biased regardless of how much data would you use because the result is independent of the data. The example is abstract, but not as abstract as you may think, for example, this would be the case for a Bayesian model with a very strong prior, or using an incorrect model for the job (e.g. image classification using a model that was designed for natural language processing), such models are doomed to make bad predictions regardless of the data.