Solved – Normalize sample data for clustering

normalization

I have three types of summary score, $a, b$ and $c$ for 200 samples.

$S1, S2, S3,…, S200$

$a_{s1}, a_{s2}, …, a_{s200}$

$b_{s1}, b_{s2}, …, b_{s200}$

$c_{s1}, c_{s2}, …, c_{s200}$

Each of these scores is essentially the number of times that some event occurs in the data of each sample. I wish to find groups of these samples by correlation-based clustering. However, the scales for each of these scores are very different:

Summary of $a$:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
2.0   36.0   55.0   52.5   69.0  139.0 

Summary of $b$:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
8.0   99.5   285.0   292.7   737.5  2624.0 

Summary of $c$:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
40.0    111.0   176.0   300.4   554.5   779.0 

Should I have to normalize the scores? If so, how?

NB. I want to make no assumptions about the dependence or independence between these types of events and hence between these summary scores.

UPDATE:
So, I've decided to try clustering with Euclidean. In order to get normalized and transformed data, I'm doing the following:
1. test whether scores fit a normal distribution with Shapiro test

  1. if not, then do a boxcox transformation using $\lambda$ from a boxcoxfit

  2. get z-score for each score

  3. cluster with euclidean distance measure

Does this seem reasonable?

Best Answer

Clustering in general requires a similarity metric to compute a partitioning of your data. Do you know how to compute the similarity of $\vec{a}$ to $\vec{b}$? Whether you need normalization or not will mainly depend on this question. If you don't have such a metric/measure, and you want to go with the regular Euclidean distance, normalizing your data -- bringing each variable to zero mean and unit variance -- would be recommended. Because if you don't, the scores with the largest range will dominate the distance computation.

Related Question