Solved – Use a combination of grand mean and group mean centering to standardize variables

clusteringhierarchical clusteringk-meansnormalizationstandardization

I'm using cluster analysis to examine profiles of three variables, X1, X2, and X3.

Because the mean and variance are very different between the three variables, I am considering standardizing them to have M = 0 and a SD = 1.

To provide a bit more information, there are 100 observations of individuals in total, with 10 observations per individual.

There are a few ways to standardize the variables. One is to use the grand mean for each of the three variables (X1, X2, and X3). Another that is somewhat common in "person-centered" or "individual-centered" analyses is to use the group mean, where the group consists of the observations for each individual.

In my present case, neither grand mean or the group mean centering seems appropriate, so I was wondering whether there cases similar or very different from the present in which a mean of the grand and group mean were used.

The goal of this is to take account of the group means, so the standardized values would account for individuals with higher values on variables for some observations (relative to their other observations), but would also account for how similar the scores are to the grand mean.

So, for example, if the grand mean for X1 were equal to 3, and the mean for a group were 3.5, each of the observations for X1 would be centered around 3.25.

The same would be done for the standard deviation for X1 as well as the same process for the mean and standard deviation for the other variables.

Would using a combination of grand mean and group mean centering to standardize variables be a viable approach?

Best Answer

Put your objective first, not your equations!

Yes, you could subtract the mean, and scale. But there are so many things you could do. For example, you could multiply wveryrhing with 0 (probably not beneficial).

Therefore, step back and rethink what you want to do.

Here are two choices you overlooked:

  1. in each attribute, take the mean of each individual. Now compute the standard deviation of the means. Scale the attribute by 1/SDmean.
  2. in each attribute, take the standard deviation of each individual. Take the mean standard deviation, and scale the attribute by 1/meanSD.

Depending on the nature of your data, either 1 or 2 will be better. But this depends on your problem and data.