As you correctly stated in (a) and (b), bias and variance are expected values evaluated over repeated sampling of a training set. Even if the bias were 0 for a model, any specific fitted model based on a particular training set will over and underestimate the true function at certain values of the domain due to sampling error; however, over the distribution of possible training sets the model is expected to have no systematic errors for any value of the domain (i.e., no bias).
To address (c) and (d), it helps to think of a very simple model:
Let $\mathcal{D} = \{(x_i,y_i)\},\;\rm{where}\;(x_i,y_i)\sim \rm{MultivariateNormal}(\mathbf{0},\mathbf{I})$
This model basically says that the $y_i$ are just iid standard normal random variables, regardless of the value of $x_i$ -- so its just a constant model $f(x)=0$
Now, let's pretend we don't know this. We could model this process using two extreme approaches:
- Fit a constant model: $E[y_i] = \mu$
- Fit an interpolating polynomial model $E[y_i] = \sum_{j=1}^{n} a_jx_i^{j-1}$
Using your terminology, the first approach is "low capacity" since it has only one free parameter, while the second approach is "high capacity" since it has $n$ parameters and fits every data point.
The first approach is correct, and so will have zero bias. Also, it will have reduced variance since we are fitting a single parameter to $n$ data points.
Contrast this with the second approach -- for any given value $x$, $f(x)$ will exactly fit the training data. On average, the fit to an arbitrary $x$ not in our dataset will not be high or low when averaged over all possible datasets of size $n$ (even the fit beyond the data will tend to have polynomials that flip from one sense to another with equal frequency). However, because it fits all the noise (since each $y_i$ is an iid standard gaussian) there will be much more variance in the estimate for an arbitrary $x$ compared to the constant model, hence it will have more variance when averaged over all possible training sets.
So, you see how having more degrees of freedom means you are fitting more of the random component in your model (assuming the model is actually not that complex).
Let's change the true model: if we now assume that our true model is actually some linear model, then what we will see is that the constant model will systematically under-estimate in some parts of the domain and over-estimate in others, no matter how much data we collect. Thus, it will have bias for most values of $x$. However, since we are still using $n$ data points to fit a single parameter, the value of $f(x)$ will be less sensitive to particular data points in the training set and hence will still have relatively low variance.
Compare this to fitting, say, a third-degree polynomial model to the data. Here, you may not have much bias (since a linear model is actually contained in a third-degree model) but the extra parameters mean that the actual polynomial will be affected by the data more than a constant model would.
The key here is that in this second case, both models are wrong -- but, the first one trades some bias for less variance while the second has increased variance but less bias. Which one is "better" (in an MSE sense) must be determined using a test dataset or cross-validation.
I'll try to answer in the simplest way. Each of those problems has its own main origin:
Overfitting: Data is noisy, meaning that there are some deviations from reality (because of measurement errors, influentially random factors, unobserved variables and rubbish correlations) that makes it harder for us to see their true relationship with our explaining factors. Also, it is usually not complete (we don't have examples of everything).
As an example, let's say I am trying to classify boys and girls based on their height, just because that's the only information I have about them. We all know that even though boys are taller on average than girls, there is a huge overlap region, making it impossible to perfectly separate them just with that bit of information. Depending on the density of the data, a sufficiently complex model might be able to achieve a better success rate on this task than is theoretically possible on the training dataset because it could draw boundaries that allow some points to stand alone by themselves. So, if we only have a person who is 2.04 meters tall and she's a woman, then the model could draw a little circle around that area meaning that a random person who is 2.04 meters tall is most likely to be a woman.
The underlying reason for it all is trusting too much in training data (and in the example, the model says that as there is no man with 2.04 height, then it is only possible for women).
Underfitting is the opposite problem, in which the model fails to recognize the real complexities in our data (i.e. the non-random changes in our data). The model assumes that noise is greater than it really is and thus uses a too simplistic shape. So, if the dataset has much more girls than boys for whatever reason, then the model could just classify them all like girls.
In this case, the model didn't trust enough in data and it just assumed that deviations are all noise (and in the example, the model assumes that boys simply do not exist).
Bottom line is that we face these problems because:
- We don't have complete information.
- We don't know how noisy the data is (we don't know how much we should trust it).
- We don't know in advance the underlying function that generated our data, and thus the optimal model complexity.
Best Answer
Overfitting is likely to be worse than underfitting. The reason is that there is no real upper limit to the degradation of generalisation performance that can result from over-fitting, whereas there is for underfitting.
Consider a non-linear regression model, such as a neural network or polynomial model. Assume we have standardised the response variable. A maximally underfitted solution might completely ignore the training set and have a constant output regardless of the input variables. In this case the expected mean squared error on test data will be approximately the variance of the response variable in the training set.
Now consider an over-fitted model that exactly interpolates the training data. To do so, this may require large excursions from the true conditional mean of the data generating process between points in the training set, for example the spurious peak at about x = -5. If the first three training points were closer together on the x-axis, the peak would be likely to be even higher. As a result, the test error for such points can be arbitrarily large, and hence the expected MSE on test data can similarly be arbitrarily large.
Source: https://en.wikipedia.org/wiki/Overfitting (it is actually a polynomial model in this case, but see below for an MLP example)
Edit: As @Accumulation suggests, here is an example where the extent of overfitting is much greater (10 randomly selected data points from a linear model with Gaussian noise, fitted by a 10th order polynomial fitted to the utmost degree). Happily the random number generator gave some points that were not very well spaced out first time!
It is worth making a distinction between "overfitting" and "overparameterisation". Overparameterisation means you have used a model class that is more flexible than necessary to represent the underlying structure of the data, which normally implies a larger number of parameters. "Overfitting" means that you have optimised the parameters of a model in a way that gives a better "fit" to the training sample (i.e. a better value of the training criterion), but to the detriment of generalisation performance. You can have an over-parameterised model that does not overfit the data. Unfortunately the two terms are often used interchangeably, perhaps because in earlier times the only real control of overfitting was achieved by limiting the number of parameters in the model (e.g. feature selection for linear regression models). However regularisation (c.f. ridge regression) decouples overparameterisation from overfitting, but our use of the terminology has not reliably adapted to that change (even though ridge regression is almost as old as I am!).
Here is an example that was actually generated using an (overparameterised) MLP