Solved – Modern machine learning and the bias-variance trade-off

bias-variance tradeoffinterpolationmachine learning

I stumbled upon the following paper Reconciling modern machine learning practice
and the bias-variance trade-off
and do not completely understand how they justify the double descent risk curve (see below), desribed in their paper.

enter image description here

In the introduction they say:

By considering larger function classes, which contain more candidate
predictors compatible with the data, we are able to find interpolating
functions that have smaller norm and are thus "simpler". Thus
increasing function class capacity improves performance of classifiers.

From this I can understand why the test risk decreases as a function of the function class capacity.

What I don't understand then with this justification, however, is why the test risk increases up to the interpolation point and then decreases again. And why is it exactly at the interpolation point that the number of data points $n$ is equal to the function parameter $N$?

I would be happy if someone could help me out here.

Best Answer

The main point about Belkin's Double Descent is that, at the interpolation threshold, i.e. the least model capacity where you fit training data exactly, the number of solutions is very constrained. The model has to "stretch" to reach the interpolation threshold with a limited capacity.

When you increase capacity further than that, the space of interpolating solutions opens-up, actually allowing optimization to reach lower-norm interpolating solutions. These tend to generalize better, and that's why you get the second descent on test data.

Related Question