Cluster Analysis – Assigning Weights to Variables in Clustering

clusteringstata

I want to assign different weights to the variables in my cluster analysis, but my program (Stata) doesn't seem to have an option for this, so I need to do it manually.

Imagine 4 variables A, B, C, D. The weights for those variables should be

w(A)=50%
w(B)=25%
w(C)=10%
w(D)=15%

I am wondering whether one of the following two approaches would actually do the trick:

  1. First I standardize all variables (e.g. by their range). Then I multiply each standardized variable with their weight. Then do the cluster analysis.
  2. I multiply all variables with their weight and standardize them afterwards. Then do the cluster analysis.

Or are both ideas complete nonsense?

[EDIT]
The clustering algorithms (I try 3 different) I wish to use are k-means, weighted-average linkage and average-linkage. I plan to use weighted-average linkage to determine a good number of clusters which I plug into k-means afterwards.

Best Answer

One way to assign a weight to a variable is by changing its scale. The trick works for the clustering algorithms you mention, viz. k-means, weighted-average linkage and average-linkage.

Kaufman, Leonard, and Peter J. Rousseeuw. "Finding groups in data: An introduction to cluster analysis." (2005) - page 11:

The choice of measurement units gives rise to relative weights of the variables. Expressing a variable in smaller units will lead to a larger range for that variable, which will then have a large effect on the resulting structure. On the other hand, by standardizing one attempts to give all variables an equal weight, in the hope of achieving objectivity. As such, it may be used by a practitioner who possesses no prior knowledge. However, it may well be that some variables are intrinsically more important than others in a particular application, and then the assignment of weights should be based on subject-matter knowledge (see, e.g., Abrahamowicz, 1985).

On the other hand, there have been attempts to devise clustering techniques that are independent of the scale of the variables (Friedman and Rubin, 1967). The proposal of Hardy and Rasson (1982) is to search for a partition that minimizes the total volume of the convex hulls of the clusters. In principle such a method is invariant with respect to linear transformations of the data, but unfortunately no algorithm exists for its implementation (except for an approximation that is restricted to two dimensions). Therefore, the dilemma of standardization appears unavoidable at present and the programs described in this book leave the choice up to the user

Abrahamowicz, M. (1985), The use of non-numerical a pnon information for measuring dissimilarities, paper presented at the Fourth European Meeting of the Psychometric Society and the Classification Societies, 2-5 July, Cambridge (UK).

Friedman, H. P., and Rubin, J. (1967), On some invariant criteria for grouping data. J . Amer. Statist. ASSOC6.,2 , 1159-1178.

Hardy, A., and Rasson, J. P. (1982), Une nouvelle approche des problemes de classification automatique, Statist. Anal. Donnies, 7, 41-56.