Solved – SOM based on a not euclidean distance

clusteringpythonrself organizing mapstrain

Suppose one has trained a SOM on a certain number of data. Without explaining all the procedure, one can say that the SOM algorithm produces a certain number of prototypes and the new elements coming in input are clustered based on the distance from the prototypes.

Two possible packages are:

  1. kohonen::som (R)
  2. somclu (Python)

Here it is explained the fact that in a high dimensional context the euclidean distance is not the best to capture the difference among vectors.
Nevertheless, relying on the two previous algorithms it seems (coincidence?) that there's not the possibility to choose a distance different from the euclidean one in order to train the models.

There exists a reason for which the euclidean distance could be the best (or only possible) one on the training process of a self organizing map?

Best Answer

This problem is not only related to the SOM, it's a general problem. In big dimensions small changes in parameters can cause big differences in distances. This problem appears when you try to measure euclidean distance in high dimensional space. Let say you have two vectors. Each of them has dimension size equal to 1000. In the first vector all elements equal to 1, and in the second one all elements equal to 0.99. Basically, all elements in the second vector reduced by 1% compare to the first one. These vectors should be pretty close to each other when you try to learn algorithm to capture some relations in the data. But here what happens when you try to compute euclidian distance.

>>> import numpy as np
>>> x = np.ones(1000)
>>> y = np.ones(1000) * 0.99
>>> x[:10]
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
>>> y[:10]
array([ 0.99,  0.99,  0.99,  0.99,  0.99,  0.99,  0.99,  0.99,  0.99,  0.99])
>>>
>>> from sklearn import metrics
>>> metrics.euclidean_distances(x,y)
array([[ 0.31622777]])
>>>
>>> metrics.pairwise.cosine_distances(x,y)
array([[ -3.55271368e-15]])

As you can see euclidean distance shows that vectors are not very close to each other. For comparison I've also added the cosine distance metric. Cosine distance is equal to zero (in the example above I got $-3 \cdot 10 ^ {-15}$, because of computational error), because two vectors have the same direction and angle between them is equal to zero