Solved – SOM based on a not euclidean distance

clusteringpythonrself organizing mapstrain

Suppose one has trained a SOM on a certain number of data. Without explaining all the procedure, one can say that the SOM algorithm produces a certain number of prototypes and the new elements coming in input are clustered based on the distance from the prototypes.

Two possible packages are:

kohonen::som (R)
somclu (Python)

Here it is explained the fact that in a high dimensional context the euclidean distance is not the best to capture the difference among vectors.
Nevertheless, relying on the two previous algorithms it seems (coincidence?) that there's not the possibility to choose a distance different from the euclidean one in order to train the models.

There exists a reason for which the euclidean distance could be the best (or only possible) one on the training process of a self organizing map?

Best Answer

This problem is not only related to the SOM, it's a general problem. In big dimensions small changes in parameters can cause big differences in distances. This problem appears when you try to measure euclidean distance in high dimensional space. Let say you have two vectors. Each of them has dimension size equal to 1000. In the first vector all elements equal to 1, and in the second one all elements equal to 0.99. Basically, all elements in the second vector reduced by 1% compare to the first one. These vectors should be pretty close to each other when you try to learn algorithm to capture some relations in the data. But here what happens when you try to compute euclidian distance.

>>> import numpy as np
>>> x = np.ones(1000)
>>> y = np.ones(1000) * 0.99
>>> x[:10]
array([ 1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.,  1.])
>>> y[:10]
array([ 0.99,  0.99,  0.99,  0.99,  0.99,  0.99,  0.99,  0.99,  0.99,  0.99])
>>>
>>> from sklearn import metrics
>>> metrics.euclidean_distances(x,y)
array([[ 0.31622777]])
>>>
>>> metrics.pairwise.cosine_distances(x,y)
array([[ -3.55271368e-15]])

As you can see euclidean distance shows that vectors are not very close to each other. For comparison I've also added the cosine distance metric. Cosine distance is equal to zero (in the example above I got $-3 \cdot 10 ^ {-15}$, because of computational error), because two vectors have the same direction and angle between them is equal to zero

Related Solutions

Solved – Performing a Self-Organizing Map on multiple distance matrices in R with a better visualization

The result is not stupid. You asked SOM to work with continuous variables (distances) and there are lots of them (1000 vars). The line traces are the "spectra"; which, in this case, means the position on the horizontal axis indicates the "variable" (in your case each column of the distance matrix) and the height in y indicates the importance of those "variables" for each of the prototypes that represents each grid in your SOM.

What you are doing doesn't make much sense, at least for som(). It should be provided with the raw data that you used to generate the distance matrix.

I suppose there could be some use for this - what the plotted results are trying to show you graphically is that the upper-left cell contains those observations that are very dissimilar to the first few observations (rows) in the matrix.

I'm not convinced it is useful however. I would have proceeded, if using SOM, by providing the raw data.

If you just want a heatmap, try

heatmap(mat)

Solved – Gaussian neighborhood function and non linear learning rate for self-organizing map in R

One key difference is that kohonen uses batch learning while som uses incremental learning. This can be the cause of the differences you are seeing.

Best Answer

Related Solutions

Solved – Performing a Self-Organizing Map on multiple distance matrices in R with a better visualization

Solved – Gaussian neighborhood function and non linear learning rate for self-organizing map in R

Related Question