It is not wise to transform the variables individually because they belong together (as you noticed) and to do k-means because the data are counts (you might, but k-means is better to do on continuous attributes such as length for example).
In your place, I would compute chi-square distance (perfect for counts) between every pair of customers, based on the variables containing counts. Then do hierarchical clustering (for example, average linkage method or complete linkage method - they do not compute centroids and threfore don't require euclidean distance) or some other clustering working with arbitrary distance matrices.
Copying example data from the question:
-----------------------------------------------------------
customer | count_red | count_blue | count_green |
-----------------------------------------------------------
c0 | 12 | 5 | 0 |
-----------------------------------------------------------
c1 | 3 | 4 | 0 |
-----------------------------------------------------------
c2 | 2 | 21 | 0 |
-----------------------------------------------------------
c3 | 4 | 8 | 1 |
-----------------------------------------------------------
Consider pair c0
and c1
and compute Chi-square statistic for their 2x3
frequency table. Take the square root of it (like you take it when you compute usual euclidean distance). That is your distance. If the distance is close to 0 the two customers are similar.
It may bother you that sums in rows in your table differ and so it affects the chi-square distance when you compare c0
with c1
vs c0
with c2
. Then compute the (root of) the Phi-square distance: Phi-sq = Chi-sq/N
where N
is the combined total count in the two rows (customers) currently considered. It is thus normalized distance wrt to overall counts.
Here is the matrix of sqrt(Chi-sq) distance between your four customers
.000 1.275 4.057 2.292
1.275 .000 2.124 .862
4.057 2.124 .000 2.261
2.292 .862 2.261 .000
And here is the matrix of sqrt(Phi-sq) distance
.000 .260 .641 .418
.260 .000 .388 .193
.641 .388 .000 .377
.418 .193 .377 .000
So, the distance between any two rows of the data is the (sq. root of) the chi-square or phi-square statistic of the 2 x p
frequency table (p
is the number of columns in the data). If any column(s) in the current 2 x p
table is complete zero, cut off that column and compute the distance based on the remaining nonzero columns (it is OK and this is how, for example, SPSS does when it computes the distance). Chi-square distance is actually a weighted euclidean distance.
Generally you don't make them comparable by doing something to the counts, but you do take account of the different exposures in computing the expected values in the chi-squared test.
Under a null hypothesis of equal event rates (events per hour), the two periods can simply be combined to estimate the rate ... that is $275+129$ events in $120+48$ hours, so we estimate the rate as $\frac{275+129}{120+48}$ events per hour, and the expected count in period 1 is then $(275+129)\frac{120}{120+48}\approx 288.57$ and in period 2 is $(275+129)\frac{48}{120+48}\approx 115.43$.
With those expected values, the chi-square goodness of fit statistic, $\sum_i \frac{(O_i-E_i)^2}{E_i}$ is straightforward to calculate by hand; it has $k-1=1$ degree of freedom in this example. However, it's a pretty standard calculation - for example, here it is in R:
eventcounts = c(275,129)
exposuretime = c(120,48)
chisq.test(eventcounts, p = exposuretime, rescale.p = TRUE)
Chi-squared test for given probabilities
data: eventcounts
X-squared = 2.2339, df = 1, p-value = 0.135
which is the same result as doing it by hand.
Best Answer
Your transformation to 6 graded categories has already thrown away much of the information you have. What you propose seems a further step without a clear statistical rationale.
At best, variations such as you report, for example
are interesting variations you want to report and explain. Some model for an ordered response such as ordinal logit might be helpful if you insist on using those grades.
But once you have those grades 0 to 5, their means and SDs are of doubtful use, if only because they depend on an arbitrary transformation. (Note that there is no sense in which those grades should be approximately normal, even after transformation, if that is what you are thinking.) So, one false step can't be corrected by another. Otherwise put, why will standardization make those variations easier to understand? Trying to explain differences in standardized values would be difficult to impossible unless you reinserted the means and SDs.
If this were my problem I would use some kind of count model, possibly Poisson regression, to deal directly with number of individuals as a response. An arbitrary degradation of the data to 6 categories would have no obvious scientific or statistical rationale or interest. I think you would have an uphill task to justify that convincingly in a report. If the counts seem too spiky (several zeros, some relatively high values) to handle easily, then an old-fashioned but still possibly useful method would be some transformation such as square roots.
A fuller answer would need more information on what data you have. At present the picture is of counts at various sites in different kinds of environment. With nothing else said, that points to an ANOVA on transformed counts (old way) or a Poisson or other count model (newer way). In such analysis, the predicted mean counts automatically give you the framework you desire, defining what is typical for an environment and hence what is not.