Solved – Cluster analysis followed by regression

clusteringmultiple regression

I have a dataset of about 1500 different hospitals and about 40 characteristics for each hospital (e.g. floor area, number patients, type of hospital, age of building, etc.). I am interested in finding out which characteristics have the strongest impact on energy consumption in the hospital. The aim of the study is to suggest ways of reducing energy consumption in some of the hospitals. My initial thought was to perform a cluster analysis to cluster hospitals according to some basic characteristics like type/floor area/number of patients. I could then do a regression analysis separately for each of the 3 or 4 clusters identified to determine which of the remaining characteristics are most influential for each cluster. My reasoning behind this is that there are certain characteristics of a hospital which will definitely impact the energy consumption, but also unchangeable (e.g. kicking out a bunch of patients may reduce energy consumption but would generally be frowned-upon!). Does this sound like a reasonable approach? Am I violating any statistical assumptions by first doing a cluster analysis and then following up with regression on each cluster? I realise I could just do a regression in the first place, but I suspect that the effect of any of the less obvious variables will be lost in the presence of the main variables.

Best Answer

Your suggestion is close to multi-level regression.

Find more explanation for example here:

http://assets.cambridge.org/97805218/67061/excerpt/9780521867061_excerpt.pdf

The gist is that the population (in your case hospitals) is not homogeneous, but that there are subgroups (levels) that can be identified. Multi-level regression in practice allows for different models per group, and insight into the difference between groups.

The difference is that you will be forming the groups based on a cluster analysis.

Related Question