K-Nearest-Neighbour High-Dimensional – Overcoming the Curse of Dimensionality in kNN Classifier

high-dimensionalk nearest neighbourself-study

I am reading Kevin Murphy's book: Machine Learning-A probabilistic Perspective. In the first chapter the author is explaining the curse of dimensionality and there is a part which i do not understand. As an example, the author states:

Consider the inputs are uniformly distributed along a D-dimensional unit cube. Suppose we estimate the density of class labels by growing a hyper cube around x until it contains the desired fraction $f$ of the data points. The expected edge length of this cube is $e_D(f) = f^{\frac{1}{D}}$.

It is the last formula that I cannot get my head around. it seems that if you want to cover say 10% of the points than the edge length should be 0.1 along each dimension? I know my reasoning is wrong but I cannot understand why.

Best Answer

That is precisely the unexpected behavior of distances in high dimensions. For 1 dimension, you have the interval [0, 1]. 10% of the points are in a segment of length 0.1. But what happens as the dimensionality of the feature space increases?

That expression is telling you that if you want to have that 10% of the points for 5 dimensions, you need to have a length for the cube of 0.63, in 10 dimensions of 0.79 and 0.98 for 100 dimensions.

As you see, for increasing dimensions you need to look further away to get the same amount of points. Even more, is telling you that most of the points are at the boundary of the cube as the number of dimensions increase. Which is unexpected.

Related Question