Cluster Analysis – Assigning Weights to Variables in Clustering

clusteringstata

I want to assign different weights to the variables in my cluster analysis, but my program (Stata) doesn't seem to have an option for this, so I need to do it manually.

Imagine 4 variables A, B, C, D. The weights for those variables should be

w(A)=50%
w(B)=25%
w(C)=10%
w(D)=15%

I am wondering whether one of the following two approaches would actually do the trick:

First I standardize all variables (e.g. by their range). Then I multiply each standardized variable with their weight. Then do the cluster analysis.
I multiply all variables with their weight and standardize them afterwards. Then do the cluster analysis.

Or are both ideas complete nonsense?

[EDIT]
The clustering algorithms (I try 3 different) I wish to use are k-means, weighted-average linkage and average-linkage. I plan to use weighted-average linkage to determine a good number of clusters which I plug into k-means afterwards.

Best Answer

One way to assign a weight to a variable is by changing its scale. The trick works for the clustering algorithms you mention, viz. k-means, weighted-average linkage and average-linkage.

Kaufman, Leonard, and Peter J. Rousseeuw. "Finding groups in data: An introduction to cluster analysis." (2005) - page 11:

The choice of measurement units gives rise to relative weights of the variables. Expressing a variable in smaller units will lead to a larger range for that variable, which will then have a large effect on the resulting structure. On the other hand, by standardizing one attempts to give all variables an equal weight, in the hope of achieving objectivity. As such, it may be used by a practitioner who possesses no prior knowledge. However, it may well be that some variables are intrinsically more important than others in a particular application, and then the assignment of weights should be based on subject-matter knowledge (see, e.g., Abrahamowicz, 1985).

On the other hand, there have been attempts to devise clustering techniques that are independent of the scale of the variables (Friedman and Rubin, 1967). The proposal of Hardy and Rasson (1982) is to search for a partition that minimizes the total volume of the convex hulls of the clusters. In principle such a method is invariant with respect to linear transformations of the data, but unfortunately no algorithm exists for its implementation (except for an approximation that is restricted to two dimensions). Therefore, the dilemma of standardization appears unavoidable at present and the programs described in this book leave the choice up to the user

Abrahamowicz, M. (1985), The use of non-numerical a pnon information for measuring dissimilarities, paper presented at the Fourth European Meeting of the Psychometric Society and the Classification Societies, 2-5 July, Cambridge (UK).

Friedman, H. P., and Rubin, J. (1967), On some invariant criteria for grouping data. J . Amer. Statist. ASSOC6.,2 , 1159-1178.

Hardy, A., and Rasson, J. P. (1982), Une nouvelle approche des problemes de classification automatique, Statist. Anal. Donnies, 7, 41-56.

Related Solutions

Solved – In cluster analysis, can you use Gower’s coefficient of similarity with a k-means clustering method

K-means is really only sensible for squared euclidean distance.

The objective function of the two steps must agree for the algorithm to always converge.

Recomputing the mean optimizes the sum-of-squares assignment (the mean is the least squares estimator!). Therefore, the distance function must optimize the same objective, unless you also compute the mean differently.

And last but not least, when you are using Gower that somewhat implies that you have categorical attributes. How would you compute a mean/centroid there, in the first place?

Solved – K-means cluster analysis with K=2 as a binary classifier

It depends on what you mean by "did pretty well" and on the population. For general adult populations in the developed world I would not expect this to work very well: heights and weights alone are not great at distinguishing the genders.

The best and easiest way to assess the situation is to make a scatterplot of height and weight, distinguishing the point symbols by gender. This one is from the (US) NHANES 2011-2012 data, where I have removed data for anyone younger than 18 years. Note the logarithmic scales, which render each point cloud approximately oval in shape. (You may guess which kind of symbol--solid red or open blue--corresponds to which gender.)

The substantial overlap between the clouds for the two genders (between 160 and 170 centimeters, approximately) shows that no cluster analysis based solely on height and weight could possibly do a very good job discriminating men from women. The partial lack of overlap, revealed by the cloud of blue above 180 cm and cloud of red below 150 cm, shows that a clustering result would nevertheless have some discriminating power. Whether this would be good enough depends on your objectives and standards for predictive accuracy.

If, in your dataset, the two clouds appear to have little or no overlap, then not only can you expect a cluster analysis (like K-means) to work well, you can already see where the cluster centers should be and where a dividing line ("linear discriminator") would approximately be located.

Here are two k-means solutions for these data: one based on the logarithms and another based on separately standardized heights and weights. The two clusters are distinguished by the lightness of the symbols.

(The number of cases shown in these plots is 90 less than the number reported in the first figure due to missing values, which should originally have been excluded.)

Evidently in both cases the clusters, although associated with gender, fail to separate the two colors very well. The better-looking solution, based on the standardized data, yields these cross-tabulation statistics of cluster and gender:

        Cluster
Gender      1    2
  Male   1951  786
  Female  586 2202

29% of all males and 21% of all females are mis-classified.

Best Answer

Related Solutions

Solved – In cluster analysis, can you use Gower’s coefficient of similarity with a k-means clustering method

Solved – K-means cluster analysis with K=2 as a binary classifier

Related Question