This is a crosspost from this question on stackoverflow. It received no answers because "it is a math question". I don't know how to move so I crosspost.
There are so many questions about distance distributions and the curse of dimensionality I did not read them all. But those that I did read, did not answer my question.
I want to create the distance distribution of uniformly distributed points in different dimensions, to ultimately visualize the curse of dimensionality. My understanding is that the distribution goes from a flatter curve in low dimensions to a spike curve in higher dimensions. They look like this:
Mean increases, variance decreases. Don't mind numbers, this is just a sketch (with gaussian distribution)
I have seen corresponding figures in papers. (Would be able to show and cite, but not sure about copyrights.) I want to recreate those figures. But when I create random points calculate the distances and put them in to a histgram it looks like this:
The first one is the plain histogram of distances. The second one, the bin are scaled to the max distance of the repective dimensions. In the third one I overlayed them by subtracting the value of the lowest non-empty bin.
The first one obviously has the means, but neither increasing height nor decreasing variance. In the second you can see the concentration of the distances, but not height or means. This is not what I saw in those papers. The third indicates that the curves have all the same shape. Again, not what I expected. This is the code I used to create the second figure.
import math
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial import distance
if __name__ == '__main__':
records = 1000
binwidth = 0.01
precision = 10
fig, ((ax1), (ax2), (ax3)) = plt.subplots(3, 1)
fig.set_size_inches(6, 10)
for dimensions in [2, 10, 100, 1000]:
data = np.random.uniform(low=0, high=1, size=(records, dimensions))
data = np.around(data, decimals=precision)
distlist = distance.pdist(data)
variance = np.var(distlist)
print('dim {dim}: var={var}'.format(dim=dimensions, var=variance))
maxdist = math.pow(dimensions, 1 / 2)
bins = [round(x * binwidth, 2) for x in range(round(maxdist / binwidth) + 2)]
histo = np.histogram(distlist, bins)
harr = np.stack((histo[1][1:], histo[0]), axis=-1)
mask = harr[:, 1] > 0
masked = harr[mask]
ax1.plot(masked[:, 0], masked[:, 1], label='dim {dim}'.format(dim=dimensions))
ax2.plot(masked[:, 0] / maxdist, masked[:, 1], label='dim {dim}'.format(dim=dimensions))
min = np.min(masked[:, 0])
ax3.plot(masked[:, 0] - min, masked[:, 1], label='dim {dim}'.format(dim=dimensions))
ax1.set_title('histogram')
ax1.legend()
ax2.set_title('histogram, bins divided by max distance')
ax2.legend()
ax3.set_title('histogram, bin subtracted by minimum bin')
ax3.legend()
plt.show()
I see two points where I may be wrong.
- My understanding of distance distributions
- the code does not what it's intended to do
How do I draw a histogram with bins of width 0.01 showing the number of distances between pairs of uniformly distributed points in the unit cube?
EDIT
Ok, thanks to @Henry it seems what I did was correct but not what I should have done to produce the results I wanted. What I want is the to produce the following figures myself.
The image is taken from
Pestov, Vladimir (2007): Intrinsic dimension of a dataset: what properties does one expect? In : 2007 IEEE International Joint Conference on Neural Networks Proceedings. International Joint Conference on Neural Networks.
Fig. 14 of
Chávez, Edgar; Navarro, Gonzalo; Baeza-Yates, Ricardo; Marroqu\’ın, José Luis (2001): Searching in metric spaces. In ACM Computing Surveys 33 (3), pp. 273–321. DOI: 10.1145/502807.502808.
is a similar one.
The text does not describe how to create them. How can I do this?
Best Answer
Does the following work? The code samples $n=1000$ points on $[0,1]^d$ for $d=2,5,10, 50,100,200, 1000, 2000, 10000$. For each $d$, we will compute the pairwise Euclidean distance between the $n$ points and plot a histogram. As the dimension increases, we will see that the distribution of the distances becomes more and more concentrated. This is bad because it means that all the points have similar distance to each other, and so we cannot really figure out different labels for these points (say in a classification problem).
Note: the 2-norm has min value zero, and max value $\sqrt{2}$, so for dimension 2, the maximum 2-norm is $\sqrt{2}$, for dim=d, the maximum is $\sqrt{d}$, which is why the x-axis range changes from plot to plot. You can change this x axis argument to suit your needs. Also you can play around with bin widths by changing the
bins
parameter in the histogram function