It is not wise to transform the variables individually because they belong together (as you noticed) and to do k-means because the data are counts (you might, but k-means is better to do on continuous attributes such as length for example).
In your place, I would compute chi-square distance (perfect for counts) between every pair of customers, based on the variables containing counts. Then do hierarchical clustering (for example, average linkage method or complete linkage method - they do not compute centroids and threfore don't require euclidean distance) or some other clustering working with arbitrary distance matrices.
Copying example data from the question:
-----------------------------------------------------------
customer | count_red | count_blue | count_green |
-----------------------------------------------------------
c0 | 12 | 5 | 0 |
-----------------------------------------------------------
c1 | 3 | 4 | 0 |
-----------------------------------------------------------
c2 | 2 | 21 | 0 |
-----------------------------------------------------------
c3 | 4 | 8 | 1 |
-----------------------------------------------------------
Consider pair c0
and c1
and compute Chi-square statistic for their 2x3
frequency table. Take the square root of it (like you take it when you compute usual euclidean distance). That is your distance. If the distance is close to 0 the two customers are similar.
It may bother you that sums in rows in your table differ and so it affects the chi-square distance when you compare c0
with c1
vs c0
with c2
. Then compute the (root of) the Phi-square distance: Phi-sq = Chi-sq/N
where N
is the combined total count in the two rows (customers) currently considered. It is thus normalized distance wrt to overall counts.
Here is the matrix of sqrt(Chi-sq) distance between your four customers
.000 1.275 4.057 2.292
1.275 .000 2.124 .862
4.057 2.124 .000 2.261
2.292 .862 2.261 .000
And here is the matrix of sqrt(Phi-sq) distance
.000 .260 .641 .418
.260 .000 .388 .193
.641 .388 .000 .377
.418 .193 .377 .000
So, the distance between any two rows of the data is the (sq. root of) the chi-square or phi-square statistic of the 2 x p
frequency table (p
is the number of columns in the data). If any column(s) in the current 2 x p
table is complete zero, cut off that column and compute the distance based on the remaining nonzero columns (it is OK and this is how, for example, SPSS does when it computes the distance). Chi-square distance is actually a weighted euclidean distance.
Best Answer
When reporting ordered or graded scales, working with simple descriptive summaries like
% improved $−$ % deteriorated
or
% ranking as good $−$ % ranking as bad
is sometimes helpful. In such summaries, omitting any neutral or middle category is common (but not essential). Clearly, such a measure gives the preponderance of two tails: if everybody improved, we get $100$, and, if everybody got worse, we get $−100$.
In political terms, an election could be imagined in which there are votes “for” and “against” from these two categories, and from that context, these measures may be described as plurality measures. (Is there a better general term, or any term that is standard in some field, for particular examples of such measures?) Whatever the terminology, such measures are discussed in Tukey (1977, pp.498–502), Zeisel (1985, pp.75–77), and Wilkinson (2005, pp.57–58).
Naturally, the percent formulation is not compulsory, and you could just as easily — in fact, a little more easily — work with proportions or fractions with results ranging from $1$ to $−1$. In either case, using a difference is natural whenever thinking is in terms of the percent or proportion scale being used. Also, a ratio such as
% ranking as good / % ranking as bad
may be less desirable with small denominators. Either the result may be unstable, or, if the denominators are ever 0, it may be indeterminate.
Let us illustrate both points with the idea of looking at gender roles across a set of activities, and
% who are female $−$ % who are male
as a way of summarizing data on who does what. If, in a village, 21 women and zero men do laundry, four men and 11 women fetch water, and 14 men and zero women take care of cows, then neither the male–female ratio nor the female–male ratio can be used throughout to summarize the balance of the sexes. Whenever zero is a denominator, the ratio is indeterminate. Even if no zeros are present, we should worry about sensitivity. However, the measure above is one which is always practical.
All that said, it should be evident that the raw frequencies remain important and should be reported, or at least easily accessible. "1/3 of the cats showed improvement, 1/3 deterioration, but the other cat ran away" has an equivalent here too.
The details should be simple in your favourite software, but for Stata details see Cox (2007) on which this is based.
Cox, N.J. 2007. How do I calculate measures such as percent improved minus percent deteriorated? http://www.stata.com/support/faqs/data-management/plurality-measures/
Tukey, J. W. 1977. Exploratory Data Analysis. Reading, MA: Addison–Wesley.
Wilkinson, L. 2005. The Grammar of Graphics. 2nd ed. New York: Springer.
Zeisel, H. 1985. Say It with Figures. 6th ed. New York: Harper & Row.