Solved – K-means cluster analysis with K=2 as a binary classifier

classificationclusteringk-means

I used two variables, height and weight, and using K-means cluster analysis with $K=2$, two clusters were obtained. I used $K=2$, as the observations either belong to men or women. I then compared the obtained clusters with the real classification. I observed that K-means did pretty well.

Does this sound logical?

Best Answer

It depends on what you mean by "did pretty well" and on the population. For general adult populations in the developed world I would not expect this to work very well: heights and weights alone are not great at distinguishing the genders.

The best and easiest way to assess the situation is to make a scatterplot of height and weight, distinguishing the point symbols by gender. This one is from the (US) NHANES 2011-2012 data, where I have removed data for anyone younger than 18 years. Note the logarithmic scales, which render each point cloud approximately oval in shape. (You may guess which kind of symbol--solid red or open blue--corresponds to which gender.)

Figure

The substantial overlap between the clouds for the two genders (between 160 and 170 centimeters, approximately) shows that no cluster analysis based solely on height and weight could possibly do a very good job discriminating men from women. The partial lack of overlap, revealed by the cloud of blue above 180 cm and cloud of red below 150 cm, shows that a clustering result would nevertheless have some discriminating power. Whether this would be good enough depends on your objectives and standards for predictive accuracy.

If, in your dataset, the two clouds appear to have little or no overlap, then not only can you expect a cluster analysis (like K-means) to work well, you can already see where the cluster centers should be and where a dividing line ("linear discriminator") would approximately be located.

Here are two k-means solutions for these data: one based on the logarithms and another based on separately standardized heights and weights. The two clusters are distinguished by the lightness of the symbols.

Figure 2

(The number of cases shown in these plots is 90 less than the number reported in the first figure due to missing values, which should originally have been excluded.)

Evidently in both cases the clusters, although associated with gender, fail to separate the two colors very well. The better-looking solution, based on the standardized data, yields these cross-tabulation statistics of cluster and gender:

        Cluster
Gender      1    2
  Male   1951  786
  Female  586 2202

29% of all males and 21% of all females are mis-classified.

Related Question