The bias is low, because you fit your model only to the 1-nearest point. This means your model will be really close to your training data.
The variance is high, because optimizing on only 1-nearest point means that the probability that you model the noise in your data is really high. Following your definition above, your model will depend highly on the subset of data points that you choose as training data. If you randomly reshuffle the data points you choose, the model will be dramatically different in each iteration. So
expected divergence of the estimated prediction function from its average value (i.e. how dependent the classifier is on the random sampling made in the training set)
will be high, because each time your model will be different.
Example
In general, a k-NN model fits a specific point in the data with the N nearest data points in your training set. For 1-NN this point depends only of 1 single other point. E.g. you want to split your samples into two groups (classification) - red and blue. If you train your model for a certain point p for which the nearest 4 neighbors would be red, blue, blue, blue (ascending by distance to p). Then a 4-NN would classify your point to blue (3 times blue and 1 time red), but your 1-NN model classifies it to red, because red is the nearest point. This means, that your model is really close to your training data and therefore the bias is low. If you compute the RSS between your model and your training data it is close to 0. In contrast to this the variance in your model is high, because your model is extremely sensitive and wiggly. As pointed out above, a random shuffling of your training set would be likely to change your model dramatically. In contrast, 10-NN would be more robust in such cases, but could be to stiff. Which k to choose depends on your data set. This highly depends on the Bias-Variance-Tradeoff, which exactly relates to this problem.
When noise is "large" then learning is not pointless, but it's "expensive" in some sense. For instance, you know the expression "house always wins". It means that the odds favor the casino against the gambler. However, the odds can be very close to 1:1, they may only so slightly be tilted towards the "house", e.g. 0.5%. Hence, you may call the outcomes series very noisy in some cases, yet the casinos make a ton of money in a long run. So, the fact that the data is "noisy" doesn't mean in isolation that the learning will be pointless or useless or unprofitable.
Best Answer
Fewer neighbors usually mean closer neighbors (unless there are multiple close neighbors with equal distance from the point of interest $x_0$). Modelling $x_0$ as a function of only the few closest neighbors, i.e. the most similar data points, allows for high flexibility (utilizing the features of the closest data points but not the ones farther apart) and thus low bias but high variance. Including more neighbors results in less flexibility (higher smootheness, utilizing the features of not only the closest data points but also the ones farther apart) and thus higher bias but lower variance.
Take an extreme example: I can model you as equalling your twin brother or a person that is the most similar to you in the whole world ($k=1$). This is highly flexible (low bias), but relying on a single data point is very risky (high variance). Or I can model you as an average (in regression) or mode (in classification) of all the people on the planet ($k=N$). This is highly inflexible (high bias) but very robust (low variance).