As you correctly stated in (a) and (b), bias and variance are expected values evaluated over repeated sampling of a training set. Even if the bias were 0 for a model, any specific fitted model based on a particular training set will over and underestimate the true function at certain values of the domain due to sampling error; however, over the distribution of possible training sets the model is expected to have no systematic errors for any value of the domain (i.e., no bias).
To address (c) and (d), it helps to think of a very simple model:
Let $\mathcal{D} = \{(x_i,y_i)\},\;\rm{where}\;(x_i,y_i)\sim \rm{MultivariateNormal}(\mathbf{0},\mathbf{I})$
This model basically says that the $y_i$ are just iid standard normal random variables, regardless of the value of $x_i$ -- so its just a constant model $f(x)=0$
Now, let's pretend we don't know this. We could model this process using two extreme approaches:
- Fit a constant model: $E[y_i] = \mu$
- Fit an interpolating polynomial model $E[y_i] = \sum_{j=1}^{n} a_jx_i^{j-1}$
Using your terminology, the first approach is "low capacity" since it has only one free parameter, while the second approach is "high capacity" since it has $n$ parameters and fits every data point.
The first approach is correct, and so will have zero bias. Also, it will have reduced variance since we are fitting a single parameter to $n$ data points.
Contrast this with the second approach -- for any given value $x$, $f(x)$ will exactly fit the training data. On average, the fit to an arbitrary $x$ not in our dataset will not be high or low when averaged over all possible datasets of size $n$ (even the fit beyond the data will tend to have polynomials that flip from one sense to another with equal frequency). However, because it fits all the noise (since each $y_i$ is an iid standard gaussian) there will be much more variance in the estimate for an arbitrary $x$ compared to the constant model, hence it will have more variance when averaged over all possible training sets.
So, you see how having more degrees of freedom means you are fitting more of the random component in your model (assuming the model is actually not that complex).
Let's change the true model: if we now assume that our true model is actually some linear model, then what we will see is that the constant model will systematically under-estimate in some parts of the domain and over-estimate in others, no matter how much data we collect. Thus, it will have bias for most values of $x$. However, since we are still using $n$ data points to fit a single parameter, the value of $f(x)$ will be less sensitive to particular data points in the training set and hence will still have relatively low variance.
Compare this to fitting, say, a third-degree polynomial model to the data. Here, you may not have much bias (since a linear model is actually contained in a third-degree model) but the extra parameters mean that the actual polynomial will be affected by the data more than a constant model would.
The key here is that in this second case, both models are wrong -- but, the first one trades some bias for less variance while the second has increased variance but less bias. Which one is "better" (in an MSE sense) must be determined using a test dataset or cross-validation.
However, what impact does training data size have on a high bias model? Generally, will more training data lower the bias, will it have no effect, or will it cause a further increase in the bias?
You mean a model with prediction errors due to high bias?
Bias, is defined as $\operatorname{Bias}[\hat{f}(x)]=\mathrm{E}[\hat{f}(x)]-f(x)$ and thus would not be affected by increasing the training set size. If your model predicts vastly different values when the training set changes, i.e., if the error is largely defined by the variance of the predictions, than you can improve the overall loss by more training data, because the model will learn to generalize better, and hence the variance term will go down. To decrease the bias term, you probably need to choose a different model.
Best Answer
I presume you are interested in the intrinsic quality of an algorithm. This is a non trivial question and the topic of active research.
Bounds on the bias and variance of an algorithm can be proven via the notion of algorithmic stability - see:
The arizona paper shows the proof for K-NN and 1-NN algorithms which is nearly perfectly unbiased (page 4). You will have to read into the other papers for other kinds of algorithms. Note that not all algorithms have proofs yet and that there are many different forms of stability with their corresponding bounds.
A different (but related) approach is to look into VC theory https://en.wikipedia.org/wiki/Vapnik%E2%80%93Chervonenkis_theory