Solved – Self Organizing Map and input normalizing

normalizationself organizing mapsvalidation

I've been playing around with self organizing maps (SOM) recently.

I tried to implement a simple example. You can see the training implementation function gist here and full contained SOM example here.

There is something strange I'm noticing and I'm not really sure why it's happening. I'm using the iris data set to both train and validate the SOM parameters. Iris data set contains data for 3 different kinds of flowers i.e. it contains 3 different groups/clusters of data.

Once I have trained the SOM I should be able to validate using the trained parameters that the Iris data really does cluster to three groups – each per flower type.

However I'm seeing something strange when validating the SOM parameters. When I normalize the input data and then try to verify/validate the trained SOM parameters using the original non-normalized iris data, the validation data seem to cluster into the same best matching unit (BMU). That is, if I normalize the input for learning and don't normalize the data for validation, I get one cluster classification instead of expected three.

However when I normalize both the input and the validation data set (like I said validation data is the same data set as the one used for training the SOM – iris data set), I do get the expected result – i.e. the data does cluster into 3 [clearly identifiable] clusters.

Equally, when I don't normalize either of the data sets I get the expected validation results.

So my question is, what would be the potential cause of the first case be: ie. training the model on normalized data and getting incorrect results when validating the model using the exact same data as validation data, but not non-normalized.

I would expect that normalizing the data to train the SOM model should have no effect on validation i.e. that once you have the model parameters, you should be able to classify non-normalized data.

I'm guessing that this could be because of using the same data set for training and validation?

Best Answer

A common way to normalize a SOM is to scale features to unit variance. The mean is subtracted from each observation and divided by the standard deviation, which is in the range [0, 1].

If you normalize the training set, but not the validation set, then you are likely comparing observations on different scales. I'd suggest using the mean and stds of the training set to normalize the validation set.

Related Solutions

Solved – Gaussian neighborhood function and non linear learning rate for self-organizing map in R

One key difference is that kohonen uses batch learning while som uses incremental learning. This can be the cause of the differences you are seeing.

Solved – How to use k-fold cross validation in naive bayes classifier

You are very close to understanding k-fold cross-validation. To answer your questions in turn.

1. So to use k-fold cross validation the required data is the labeled data?

Yes, you must have some 'known' result in order for your model to be trained on the data. You are building a model, I assume, to predict some sort of outcome either regression or classification. In order to do so, a model must be built on data to explain some known result.

2. How about non labeled data?

For k-fold cross-validation, you will have split your data into k groups (e.g. 10). You then select one of those groups and use the model (built from your training data) to predict the 'labels' of this testing group. Once you have your model built and cross-validated, then it can be used to predict data that don't currently have labels. The cross-validation is a means to prevent overfitting.

As a last clarification, you aren't only using 1 of the 10 groups. Let's say you had 100 samples. You split it into groups 1-10, 11-20, ... 91-100. You would first train on all the groups from 11-100 and predict the test group 1-10. Then you would repeat the same analysis on 1-10 and 21-100 as the training and 11-20 as the testing group and so forth. The results typically averaged at the end.

As a simple example say I have the following abbreviated data (binary classification):

Label    Variable
A        0.354
A        0.487
A        0.384
A        0.395
A        0.436
B        0.365
B        0.318
B        0.327
B        0.381
B        0.355

Let's say I want to do 10-fold cross-validation on this (nearly Leave-One-Out cross-validation in this case)

My first testing group will be:

A        0.354
A        0.487

My training set is the remaining data. See how the labels are present in both groups?

A        0.384
A        0.395
A        0.436
B        0.365
B        0.318
B        0.327
B        0.381
B        0.355

Please note that it is also best practice to randomize the grouping, this is purely for demonstration

Then you fit your model to the training set, which is using the variable(s) to best explain the labels (class A or B). The model that has been fit to this training set is then used to predict the testing dataset. You remove the labels from the testing set and predict them using the trained model. You then compare the predicted labels to the actual labels. This is repeated for all 10-folds and the results averaged.

Once everything is completed and you have your wonderfully cross-validated model, you can use it to predict unlabeled data and have some sort measure of confidence in your results.

Extended for Parameter Tuning

Let's say you are tuning a partial least squares (PLS) model (it doesn't matter if you don't know what this is for demonstration purposes). I would like determine how many components (the tuning parameter) I should have in the model. I would like to test 2,3,4, and 5 components and see how many I should use to maximize my predictive accuracy without overfitting the model. I would conduct the entire cross-validation series for each component number. Each iteration of the CV would be averaged (the average predictive accuracy of the entire analysis).

Assuming classification accuracy is your metric let's say these are my results (completely made up here):

2 components: 70%
3 components: 82%
4 components: 78%
5 components: 74%

Clearly, I would then choose 3 components for my model which has now been cross-validated to avoid overfitting and maximizing predictive accuracy. I can then use this optimized model to predict a new dataset where I don't know the labels.

Best Answer

Related Solutions

Solved – Gaussian neighborhood function and non linear learning rate for self-organizing map in R

Solved – How to use k-fold cross validation in naive bayes classifier

Related Question