Fewer neighbors usually mean closer neighbors (unless there are multiple close neighbors with equal distance from the point of interest $x_0$). Modelling $x_0$ as a function of only the few closest neighbors, i.e. the most similar data points, allows for high flexibility (utilizing the features of the closest data points but not the ones farther apart) and thus low bias but high variance. Including more neighbors results in less flexibility (higher smootheness, utilizing the features of not only the closest data points but also the ones farther apart) and thus higher bias but lower variance.
Take an extreme example: I can model you as equalling your twin brother or a person that is the most similar to you in the whole world ($k=1$). This is highly flexible (low bias), but relying on a single data point is very risky (high variance). Or I can model you as an average (in regression) or mode (in classification) of all the people on the planet ($k=N$). This is highly inflexible (high bias) but very robust (low variance).
Welcome Chukwudi. The definitions can be a bit confusing, but it looks like you are thinking of bias and variance as properties of the numerical estimates themselves instead of as properties of the estimator (i.e. of the function or algorithm that produces the numerical value of the estimates). I suggest you look at the mathematical definitions of these terms again before continuing on, paying attention to the notation. It's super important in stats to understand the notation and what is random and what is not. Notation will usually indicate what the "truth" (not random... for now, at least) is, what is observed data (random and not random), and what is an estimator (most always random, because they are functions of random variables).
- It might be more intuitive to think of bias as "on average, how wrong will my numerical estimate be, compared to the true value if I use this method/estimator" and variance as "on average, how much will my numerical estimate change if I use this method/estimator on different datasets."
- Again, remember that both bias and variance are expected values (i.e. averages over all datasets), and that bias is the average of the true value minus the estimate. In your examples above, you have not assumed any true values that I can see.
- Even if you had assumed true values, your examples are not a fair comparison. It's not a fair comparison to look at the bias of one estimator (your 1-NN example) versus the average bias of three estimators (your 3-NN example). To do this kind of numerical comparison fairly, you need to 1) apply both algorithms to the same dataset (y and x values), compare each estimate to the estimate from the other algorithm, then generate new data with the same true values and variance, and do it again, and again...
A few quick examples to clarify and develop intuition:
For the following examples, assume that the true data-generating mechanism is $y_i = f(x_i) + \epsilon_i$, where $\epsilon$ is just noise (i.e. mean zero, but nonzero variance $\sigma^2$). Pretend that I am trying to estimate/predict a new $y_{new}$ associated with a new $x_{new}$. Bias then is defined as $E[\hat{y}_{new} - f(x_{new})]$.
Consider a (dumb) estimating function/estimator, $\hat{y_{new}}=6$. That is, I don't care what the surrounding y values are, I don't care what the x value is, I just think it's 6. This will always be biased except when the underlying truth is coincidentally that $f(x_{new})=6$. But this estimator has 0 variance, because no matter what my data is, I still just estimate $\hat{y_{new}}=6$.
On the other hand, consider the estimator $\hat{y_{new}}=\bar{y}$ (n-NN regression). Now at least I'm learning a tiny bit from my data, so I'm probably less biased than having just chosen a random number such as 6.... but my bias is still large: for any prediction I try to make, I'm probably very biased unless somehow the true mean value of what I am trying to predict ($f(x_{new})$ is close to the overall true mean value of the observations in my data (average of all $f(x_i)$ for all $x_i$ in my data). And the variance of my estimator is now non-zero ($\sigma^2/n$).
Notice in both of the above cases, I only get small bias in particular locations or by dumb luck. I'm very likely to be doing a bad job of fitting most of the data. (i.e. underfitting / oversmoothing from oversimplified models).
Now consider 1-NN regression. Now I probably have pretty small prediction bias in most places - if I have any datapoint close to what I am predicting, I should do fairly well in terms of bias (assuming the function is reasonably smooth). However, the variance of my prediction is $\sigma^2$, since my prediction is just the value of the closest observation (which has variance $\sigma^2$, i.e. the variance is $n$ times larger than the n-NN case, and 2 times larger than the 2-NN case, 3 times larger than 3-NN, and so on).
Hope this helps!
Best Answer
First of all, the bias of a classifier is the discrepancy between its averaged estimated and true function, wheras the variance of a classifier is the expected divergence of the estimated prediction function from its average value (i.e. how dependent the classifier is on the random sampling made in the training set).
Hence, the presence of bias indicates something basically wrong with the model, whereas variance is also bad, but a model with high variance could at least predict well on average.
The key to understand examples generating Figures 2.7 and 2.8 is:
Recall the the target function of the example generating Figure 2.7 depends on $p$ variables, and hence the MSE error is largely due to the bias.
Conversely, in Figure 2.8 the target function of the example depends only on $1$ variable, and thus the variance dominates. More generally, this happens when you are dealing with low dimensions.
I hope this could help.