Solved – Bias and Variance, Overfitting and Underfitting

bias-variance tradeoffoverfitting

I am trying to understand the concept of bias and variance and their relationship with overfitting and underfitting. Right now my understanding of bias and variance is as follows.(The following argument is not rigorous, so I apologize for that)

Suppose there is a function $f:\mathcal{X}\to\mathbb{R}$, and we are given a training set $\mathcal{D}=\{(x_i,y_i):1\le i\le m\}$, i.e., we know $y_i=f(x_i)$ for $1\le i\le m$, and that $(x_i,y_i)$ in $\mathcal{D}$ are sampled from the same distribution $P(x,y)$. We use the training set $\mathcal{D}$ to obtain a model $\hat{f_{\mathcal{D}}}$.

(a) The bias and variance of the learning algorithm is then the bias and variance of the estimator $\hat{f_{\mathcal{D}}}(\hat{x})$ on an unseen example $\hat{x}$, respectively. Is this correct?

(b) The bias of $\hat{f_{\mathcal{D}}}(\hat{x})$ on an unseen example $\hat{x}$ is defined to be $\mathbb{E}[\hat{f_{\mathcal{D}}}(\hat{x})-f(\hat{x})]$, where the expectation is taken across all possible training sets $\mathcal{D}$. In other words, we repeat the procedure of sampling the training set $\mathcal{D}$ from $P(x,y)$, and then obtaining a function $\hat{f_{\mathcal{D}}}$ for each such $\mathcal{D}$. We then compute the average $\hat{f_{\mathcal{D}}}(\hat{x})-f(\hat{x})$. Is this a correct interpretation of bias? Different $\mathcal{D}$s can give different $\hat{f_{\mathcal{D}}}(\hat{x})$, and the variance of $\hat{f_{\mathcal{D}}}(\hat{x})$ measures the expected deviation of $\hat{f_{\mathcal{D}}}(\hat{x})$'s from the expected value of $\hat{f_{\mathcal{D}}}(\hat{x})$. Am I correct?

(c) I read that as the model capacity increases, the bias decreases and the variance increases. Although I could understand this somewhat intuitively, I don't quite understand this when I think in terms of the definition of bias. If the model capacity increases, then it is prone to over-fitting. So the learned model does worse on the unseen example $\hat{x}$. In other words, $\hat{f_{\mathcal{D}}}(\hat{x})$ is not very close to $f(\hat{x})$ on an unseen example $\hat{x}$. This will be true if we repeat the procedure for different samples of training set $\mathcal{D}$. Then why is the average error, which is the bias $\mathbb{E}[\hat{f_D}(\hat{x})-f(\hat{x})]$, low? Is it because the positive deviations and negative deviations balance each other to give a low bias?

(d) (very important) Can you explain why high capacity means low bias and high variance (and low capacity means high bias and low variance) using part (b)?

Best Answer

As you correctly stated in (a) and (b), bias and variance are expected values evaluated over repeated sampling of a training set. Even if the bias were 0 for a model, any specific fitted model based on a particular training set will over and underestimate the true function at certain values of the domain due to sampling error; however, over the distribution of possible training sets the model is expected to have no systematic errors for any value of the domain (i.e., no bias).

To address (c) and (d), it helps to think of a very simple model:

Let $\mathcal{D} = \{(x_i,y_i)\},\;\rm{where}\;(x_i,y_i)\sim \rm{MultivariateNormal}(\mathbf{0},\mathbf{I})$

This model basically says that the $y_i$ are just iid standard normal random variables, regardless of the value of $x_i$ -- so its just a constant model $f(x)=0$

Now, let's pretend we don't know this. We could model this process using two extreme approaches:

  1. Fit a constant model: $E[y_i] = \mu$
  2. Fit an interpolating polynomial model $E[y_i] = \sum_{j=1}^{n} a_jx_i^{j-1}$

Using your terminology, the first approach is "low capacity" since it has only one free parameter, while the second approach is "high capacity" since it has $n$ parameters and fits every data point.

The first approach is correct, and so will have zero bias. Also, it will have reduced variance since we are fitting a single parameter to $n$ data points.

Contrast this with the second approach -- for any given value $x$, $f(x)$ will exactly fit the training data. On average, the fit to an arbitrary $x$ not in our dataset will not be high or low when averaged over all possible datasets of size $n$ (even the fit beyond the data will tend to have polynomials that flip from one sense to another with equal frequency). However, because it fits all the noise (since each $y_i$ is an iid standard gaussian) there will be much more variance in the estimate for an arbitrary $x$ compared to the constant model, hence it will have more variance when averaged over all possible training sets.

So, you see how having more degrees of freedom means you are fitting more of the random component in your model (assuming the model is actually not that complex).

Let's change the true model: if we now assume that our true model is actually some linear model, then what we will see is that the constant model will systematically under-estimate in some parts of the domain and over-estimate in others, no matter how much data we collect. Thus, it will have bias for most values of $x$. However, since we are still using $n$ data points to fit a single parameter, the value of $f(x)$ will be less sensitive to particular data points in the training set and hence will still have relatively low variance.

Compare this to fitting, say, a third-degree polynomial model to the data. Here, you may not have much bias (since a linear model is actually contained in a third-degree model) but the extra parameters mean that the actual polynomial will be affected by the data more than a constant model would.

The key here is that in this second case, both models are wrong -- but, the first one trades some bias for less variance while the second has increased variance but less bias. Which one is "better" (in an MSE sense) must be determined using a test dataset or cross-validation.

Related Question