It is not wise to transform the variables individually because they belong together (as you noticed) and to do k-means because the data are counts (you might, but k-means is better to do on continuous attributes such as length for example).
In your place, I would compute chi-square distance (perfect for counts) between every pair of customers, based on the variables containing counts. Then do hierarchical clustering (for example, average linkage method or complete linkage method - they do not compute centroids and threfore don't require euclidean distance) or some other clustering working with arbitrary distance matrices.
Copying example data from the question:
-----------------------------------------------------------
customer | count_red | count_blue | count_green |
-----------------------------------------------------------
c0 | 12 | 5 | 0 |
-----------------------------------------------------------
c1 | 3 | 4 | 0 |
-----------------------------------------------------------
c2 | 2 | 21 | 0 |
-----------------------------------------------------------
c3 | 4 | 8 | 1 |
-----------------------------------------------------------
Consider pair c0
and c1
and compute Chi-square statistic for their 2x3
frequency table. Take the square root of it (like you take it when you compute usual euclidean distance). That is your distance. If the distance is close to 0 the two customers are similar.
It may bother you that sums in rows in your table differ and so it affects the chi-square distance when you compare c0
with c1
vs c0
with c2
. Then compute the (root of) the Phi-square distance: Phi-sq = Chi-sq/N
where N
is the combined total count in the two rows (customers) currently considered. It is thus normalized distance wrt to overall counts.
Here is the matrix of sqrt(Chi-sq) distance between your four customers
.000 1.275 4.057 2.292
1.275 .000 2.124 .862
4.057 2.124 .000 2.261
2.292 .862 2.261 .000
And here is the matrix of sqrt(Phi-sq) distance
.000 .260 .641 .418
.260 .000 .388 .193
.641 .388 .000 .377
.418 .193 .377 .000
So, the distance between any two rows of the data is the (sq. root of) the chi-square or phi-square statistic of the 2 x p
frequency table (p
is the number of columns in the data). If any column(s) in the current 2 x p
table is complete zero, cut off that column and compute the distance based on the remaining nonzero columns (it is OK and this is how, for example, SPSS does when it computes the distance). Chi-square distance is actually a weighted euclidean distance.
My experience is still rather limited with mlogit
package, but if I read Croissant vignette correctly (see the beginning of sec. 1.2 Model description, page 7), the alt
variable in your model is specified as alternative specific with a generic coefficient and NOT as an individual specific covariate---those variables are placed between the pipes.
Best Answer
I've had luck with setting epsilon to half of the smallest non-zero value and replacing all 0 values with epsilon and all 1 values with 1-epsilon. Then apply the logit transformation.
This method keeps the original form of the logit transformation, but allows 1 and 0 to be transformed to values that match the overall shape of the intended transformation (note the black dots in the figure at raw=0 and 1). In particular, it preserves the quality that 0.5 is transformed to 0, and the rest of the values are symmetric.
On the other hand, adding the smallest non-zero value as described in the paper changes the shape of the curve and destroys the symmetry.