Solved – Time complexity of leave-one-out cross validation for nonparametric regression

cross-validationnonparametric-regression

From Artificial Intelligence: A modern approach:

Most nonparametric models have the advantage that it is easy to do leave-one-out crossvalidation
without having to recompute everything. With a k-nearest-neighbors model, for
instance, when given a test example (x, y) we retrieve the k nearest neighbors once, compute
the per-example loss L(y, h(x)) from them, and record that as the leave-one-out result for
every example that is not one of the neighbors. Then we retrieve the k + 1 nearest neighbors
and record distinct results for leaving out each of the k neighbors. With N examples the
whole process is O(k), not O(kN).

  • Why would we record that as the leave-one-out result for
    every example that is not one of the neighbors
    ? Since we only retrieved one test example (x, y), shouldn't we only record the result for that example?

  • Why would we retrieve the k + 1 nearest neighbors afterwards? Isn't our goal to verify a k-nearest neighbors model?

  • Why is the whole process O(k), not O(kN)? Doesn't LOOCV mean that we validate every single one of the N examples, so our complexity can't really drop below O(N)?

Best Answer

I think the key point is:

Doesn't LOOCV mean that we validate every single one of the N examples

yes

so our complexity can't really drop below O(N)?

no, this doesn't necessarily follow. Completely recalculate N times is the worst case which can always be achieved.

For some classifiers (and depending on the actual classifier under consideration) it is possible to regroup the calculations done in those N training procedures with n - 1 case each, so that certain intermediate results can be reused. The idea is to show that the same result you get with the elaborate N times recalculation is reached by a shorter algorithm.

For kNN, the savings are particularly large, so that you can get the LOO predictions almost for free. The key idea here is look which predictions are influenced by leaving out one particular case. If you do this, you'll realize that most cases aren't influenced.
I suggest to draw an example case and visualize step by step what your text says. This will answer your 1st and 2nd question.

BTW: without having thought about this in detail, I don't see that non-parametric is a good rule-of-thumb here. Yes, kNN is non-parametric, but it is also a local method (predictions depend only on cases closeby in sample space) - and this is what the suggested regrouping strategy uses. A model that uses median instead of mean OTOH is far less localized and therfore the presented strategy doesn't help. And while you can easily update the median for the left-out case, updating the mean isn't that much more complicated, neither.