I have three types of summary score, $a, b$ and $c$ for 200 samples.
$S1, S2, S3,…, S200$
$a_{s1}, a_{s2}, …, a_{s200}$
$b_{s1}, b_{s2}, …, b_{s200}$
$c_{s1}, c_{s2}, …, c_{s200}$
Each of these scores is essentially the number of times that some event occurs in the data of each sample. I wish to find groups of these samples by correlation-based clustering. However, the scales for each of these scores are very different:
Summary of $a$:
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.0 36.0 55.0 52.5 69.0 139.0
Summary of $b$:
Min. 1st Qu. Median Mean 3rd Qu. Max.
8.0 99.5 285.0 292.7 737.5 2624.0
Summary of $c$:
Min. 1st Qu. Median Mean 3rd Qu. Max.
40.0 111.0 176.0 300.4 554.5 779.0
Should I have to normalize the scores? If so, how?
NB. I want to make no assumptions about the dependence or independence between these types of events and hence between these summary scores.
UPDATE:
So, I've decided to try clustering with Euclidean. In order to get normalized and transformed data, I'm doing the following:
1. test whether scores fit a normal distribution with Shapiro test
-
if not, then do a boxcox transformation using $\lambda$ from a boxcoxfit
-
get z-score for each score
-
cluster with euclidean distance measure
Does this seem reasonable?
Best Answer
Clustering in general requires a similarity metric to compute a partitioning of your data. Do you know how to compute the similarity of $\vec{a}$ to $\vec{b}$? Whether you need normalization or not will mainly depend on this question. If you don't have such a metric/measure, and you want to go with the regular Euclidean distance, normalizing your data -- bringing each variable to zero mean and unit variance -- would be recommended. Because if you don't, the scores with the largest range will dominate the distance computation.