Solved – Normalize sample data for clustering


I have three types of summary score, $a, b$ and $c$ for 200 samples.

$S1, S2, S3,…, S200$

$a_{s1}, a_{s2}, …, a_{s200}$

$b_{s1}, b_{s2}, …, b_{s200}$

$c_{s1}, c_{s2}, …, c_{s200}$

Each of these scores is essentially the number of times that some event occurs in the data of each sample. I wish to find groups of these samples by correlation-based clustering. However, the scales for each of these scores are very different:

Summary of $a$:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
2.0   36.0   55.0   52.5   69.0  139.0 

Summary of $b$:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
8.0   99.5   285.0   292.7   737.5  2624.0 

Summary of $c$:

Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
40.0    111.0   176.0   300.4   554.5   779.0 

Should I have to normalize the scores? If so, how?

NB. I want to make no assumptions about the dependence or independence between these types of events and hence between these summary scores.

So, I've decided to try clustering with Euclidean. In order to get normalized and transformed data, I'm doing the following:
1. test whether scores fit a normal distribution with Shapiro test

  1. if not, then do a boxcox transformation using $\lambda$ from a boxcoxfit

  2. get z-score for each score

  3. cluster with euclidean distance measure

Does this seem reasonable?

Best Answer

Clustering in general requires a similarity metric to compute a partitioning of your data. Do you know how to compute the similarity of $\vec{a}$ to $\vec{b}$? Whether you need normalization or not will mainly depend on this question. If you don't have such a metric/measure, and you want to go with the regular Euclidean distance, normalizing your data -- bringing each variable to zero mean and unit variance -- would be recommended. Because if you don't, the scores with the largest range will dominate the distance computation.

Related Question