Solved – Self Organizing Map and input normalizing

normalizationself organizing mapsvalidation

I've been playing around with self organizing maps (SOM) recently.

I tried to implement a simple example. You can see the training implementation function gist here and full contained SOM example here.

There is something strange I'm noticing and I'm not really sure why it's happening. I'm using the iris data set to both train and validate the SOM parameters. Iris data set contains data for 3 different kinds of flowers i.e. it contains 3 different groups/clusters of data.

Once I have trained the SOM I should be able to validate using the trained parameters that the Iris data really does cluster to three groups – each per flower type.

However I'm seeing something strange when validating the SOM parameters. When I normalize the input data and then try to verify/validate the trained SOM parameters using the original non-normalized iris data, the validation data seem to cluster into the same best matching unit (BMU). That is, if I normalize the input for learning and don't normalize the data for validation, I get one cluster classification instead of expected three.

However when I normalize both the input and the validation data set (like I said validation data is the same data set as the one used for training the SOM – iris data set), I do get the expected result – i.e. the data does cluster into 3 [clearly identifiable] clusters.

Equally, when I don't normalize either of the data sets I get the expected validation results.

So my question is, what would be the potential cause of the first case be: ie. training the model on normalized data and getting incorrect results when validating the model using the exact same data as validation data, but not non-normalized.

I would expect that normalizing the data to train the SOM model should have no effect on validation i.e. that once you have the model parameters, you should be able to classify non-normalized data.

I'm guessing that this could be because of using the same data set for training and validation?

Best Answer

A common way to normalize a SOM is to scale features to unit variance. The mean is subtracted from each observation and divided by the standard deviation, which is in the range [0, 1].

If you normalize the training set, but not the validation set, then you are likely comparing observations on different scales. I'd suggest using the mean and stds of the training set to normalize the validation set.