Solved – Data standardization vs. normalization for clustering analysis

clusteringmachine learningpca

I'm performing clustering analysis and visualization (hierarchal, PCA, T-SNE etc.) on a dataset, and a bit confused about the method for data preparation. I understand that the typical options are to standardize, normalize, or log transform, but it seems like there are no hard and fast rules regarding when you apply one over the other?

With standardization and log-transformation – my dataset splits into two clusters with a number of different algorithms. One cluster is large and heterogeneous (which is actually interesting as this is a biological problem and makes logical sense). However, if I normalize the data, I get three clusters out of it – splits the heterogeneous cluster into two. This could make sense as well, but it would be a stretch, and the clusters are not as clean. What could be causing this? The non-heterogeneous cluster remains the same, which is reassuring. Is it reasonable to conclude that the "instability" of the second cluster is further evidence of the heterogeneity in the dataset?

Best Answer

There cannot be a general rule on what to do.

Any automatic normalization is usually "wrong". They only happen to usually work better than not weighting features at all, so people commony use them - in particular on data they don't understand. But the right way is to weight and scale features such they have the right balanced amount of influence on the results. As there is no mathematical way to capture this "right balance" (it's not uniform!) there cannot be an automatic solution. You have to understand your data and scale each feature to give it he desired amount of influence.