Solved – In Machine Learning, how does getting more training examples fix high variance $(Var(\hat f(x_{0})))$

machine learningstatistical-learningvariance

I don't believe that (Why does increasing the sample size lower the variance?) appropriately handles my question!

The linked questions explains why any addition of random variables (all iid) produces a new RV whose variance is less than the variance of any one RV. In this case, the effect of adding a "training example" is to add a random variable to the summation representing the average: $$ Y = \frac{\sum_{i = 1}^{n} X_{i}}{n}$$

The effect of adding a training example in an ML context has a much different effect (What is meant by the variance of *functions* in *Introduction to Statistical Learning*?)

$$ \mathcal{A} : \mathcal{T} \rightarrow \{f \mid f: X \rightarrow \mathbb{R} \} $$

A learning algorithm is a mapping from any subset, $\tau$, of the set of all training examples to some predictor (which was trained using the algorithm $A$). Each subset of the training set has a certain probability of being selected. Thus, to add another training example to the set of training examples is to, perhaps, fundamentally change the probability of some subsets being selected. This, in turn, alters $E[A_{x_{0}}]$ (the expectation of random variable: $A_{x_{0}}: T \rightarrow f_{t}(x_{0})$ where $f_{t}$ is the predictor given by a particular choice of subset $t \in T$). As a result, this also changes the variance of $A_{x_{0}}$.

Andrew Ng's ML course makes the claim that increasing the training examples in some training set (some subset of the set of all training examples) decreases the variance of the learning algorithm. Can this either be rigorously shown using the above definition of variance or given some intuitive grounding? I can't find any explanation in his course as to why this might be true, only a learning curve showing the phenomenon.

Best Answer

I will try to give a intuitive explanation by assuming an extreme case, although it is not a general proof.

Let us use the definitions in the question that you mentioned: What is meant by the variance of *functions* in *Introduction to Statistical Learning*?

We know the random variable $T$ takes a value from the space of possible training sets. Now, assume this ideal condition: We have a large training set which is so large that contains all of the possible training examples. More formally, for each training set $T$, we have:

$size \ of \ T=number \ of \ all \ possible \ training \ examples$

It is easy to show that in this situation, the probability distribution function of $T$ has the value one at one point, and zero at all of the other points (i.e. it is an impulse function). That is because there is exactly one training set that satisfies the above equation.

Clearly, in this case, $T$ has zero variance, which in turn results in a zero variance for the learned model ($A$). Therefore, a very large training set caused the variance of the learning algorithm to become zero.

Related Question