Solved – Weighted cases in a cluster analysis for cases in SPSS

clusteringspssweighted-dataweighted-sampling

I am conducting a cluster analysis (of cases) for a database which has weight attributed to the individual cases to ensure that it mirrors the general population in terms of sociodemographic distribution.

Since hierarchical clustering ignores attributed weight, is it statistically sound to commence a non-weighted hierarchical cluster analysis to determine the best solution, and then use k-means with that number of clusters? Or would it be more reasonable to commence various k-means solutions until a good fit is assumed?

Or can we find a way to perform weighted hierarchical cluster analysis (weighted are the objects to cluster and the weights are frequencies)?

Best Answer

Using K-means after hierarchical clustering or hierarchical clustering after K-means may be sometimes a sound trick on its own - not because of weighting.

Frequency weighting of objects when clustering objects

Now about weighting. To do hierarchical cluster analysis of cases with frequency weights attached to the cases (objects to cluster):

Approach 1, general. Propagate objects. Multiply the weights by a constant so that the smaller individual weight becomes about 1, and then round the weights; and propagate cases according to those frequencies. For example, if you have 4 groups of cases with corresponding case weights 0.55 0.23 1.98 1.14, multiplying by 4.35 yields 2.39 1.00 8.61 4.96 and then rounding to frequencies 2 1 9 5. Propagate each case of the corresponding group this number of times. In SPSS a syntax to propagate cases is as follows.

loop #i= 1 to FREQ. /*FREQ is that recalculated weighting variable
xsave outfile= 'FILE.SAV' /keep= VARS. /*FILE.SAV is the dataset you save to hard disk: path and filename
           /*Optional /keep= VARS is the list of variables you want to save with the file
           /*In your case that will be of course all the features you cluster by
end loop.
exec.

If you need 10 times greater precision in compliance to the original fractional weights, multiply by 10 before rounding, 23.9 10.0 86.1 49.6 so that the frequencies of propagation will be 24 10 86 50. However, duplicating cases these big number of times may make the dataset too big for a single hierarchical cluster analysis. So don't be too hard with precision.

On the other hand, propagated big dataset you can cluster-analyze by randomly selected subsamples - several times. Then you could combine the results (several approaches are possible). [Actually, to perform such resampling with replacement in SPSS you don't need to propagate the data first. However, I will stop and won't go in details of syntax to do it.]

After propagation of cases and before clustering, you may want to add tiny random noise to quantitative features - to untie identical cases. It will make results of clustering less dependent on the order of cases in the dataset.

do repeat x= var1 var2 var3. /*list of quantitative variables
compute x= x+rv.uniform(0,0.00001). /*a noise value between 0 and, say, 0.00001
end repeat.
exec.

If you are working with already built distance matrix (rather than the dataset) then propagate its rows/columns times you need. In SPSS, you may use handy matrix function !propag() for that - see my web-page.

Approach 2. Use resumed agglomeration. Some implementations of hierarchical clustering (an example is my own SPSS macro for hierarchical clustering found on my web-page) allow to interrupt agglomeration and save the currently left distance matrix; that matrix has additional column with within-cluster frequencies so far. The matrix can be used as input to "resume" the clustering. Now, the fact is that some methods of agglomeration, namely, single (nearest neighbour), complete (farthest neighbour), between-group average (UPGMA), centroid and median, do not notice or make difference about what is the within-cluster density when they merge two clusters. Therefore, for these methods resuming agglomeration is equivalent to doing agglomeration with initial frequency weights attached. So, if your program has the option to interrupt/resume agglomeration you may use it, under the above methods, to "simulate" weighted input succesfully, and you don't need to propagate rows/columns of the matrix.

Moreover, three methods - single, complete and median (WPGMC) are known to ignore even the within-cluster frequencies when they merge two clusters. Therefore frequency weighting (either by approach 1 or approach 2) for these methods appear needless altogether. They are insensitive to it and will give the same classification of objects without weightind as with weighting. The only difference will be in the dendrogram looks because with weighting you use more objects to combine and it should show up on the dendro.

As for weighting cases in K-means clustering procedure, SPSS allows it: the procedure obeys weighting regime. This is understandable: K-means computation can easily and naturally incorporate integer or fractional weights while computing cluster means. Propagation of cases should give very similar results to clustering under weighting switched on.

Two-step cluster analysis of SPSS doesn't support weighting cases, like hierarchical clustering. So the solution here is the propagation of cases described above in approach 1.