Machine Learning – Understanding the Curse of Dimensionality

high-dimensionalmachine learning

I understand how the curse of high dimensionality works when most features are irrelevant in this highly cited article: A Few Useful Things to Know About Machine Learning, but I get stuck in reading the following illustration:

This
is because in high dimensions all
examples look alike. Suppose, for
instance, that examples are laid out
on a regular grid, and consider a test
example $x_t$. If the grid is d-dimensional, $x_t$’s 2d nearest examples are
all at the same distance from it. So as
the dimensionality increases, more
and more examples become nearest
neighbors of $x_t$, until the choice of
nearest neighbor (and therefore of
class) is effectively random

The first sentence is unintuitive. Especially, why If the grid is d-dimensional, $x_t$’s 2d nearest examples are
all at the same distance from it
?

To make it concrete, there are two coordinates(A and B) in a 2d plane:

enter image description here
Using geogebra

Suppose that (0, 0) is the test example, and we can see that A is closer to it than B. I wonder if B would be as close to the test example as A in any higher dimensional space? If so, how? If not, how would all samples look alike?

Best Answer

Let's look at the first few dimensions.

  • For $d=1$, if examples are laid out on a regular grid, this just means that they are at equal distances on a straight line, e.g., at the integers. We can assume that our test example $x_t$ is at the origin, $x_t=0$. There are two nearest neighbors with equal distance $1$, namely the points at $1$ and $-1$.

  • For $d=2$, we have a plane, and the regular grid could consist of all the two-dimensional integer points. For a test example again (without loss of generality) at the origin, there are four nearest neighbors, again all with distance $1$: $(0,1)$, $(-1,0)$, $(0,-1)$ and $(1,0)$.

  • For $d=3$, we have a regular grid in three-dimensional space. Our test example at the origin now has six nearest neighbors, all at distance $1$.

In general, since we have $d$ dimensions and can assume that our regular grid just consists of the $d$-dimensional integer points, we can take the test example at the origin, and then we can find all $2d$ nearest neighbors by choosing one of the $d$ dimensions and setting that coordinate to either $1$ or $-1$, and leaving all other coordinates at $0$.

The problem this illustrates is that if there is no structure in our problem (i.e., our examples are on a regular grid, with no clusters, perhaps with some noise), then selecting a fixed number $k$ of nearest neighbors may simply mean picking them at random, since there are so many nearest neighbors, which may only be differentiated because of noise.

Related Question