I have a dataset of about 1500 different hospitals and about 40 characteristics for each hospital (e.g. floor area, number patients, type of hospital, age of building, etc.). I am interested in finding out which characteristics have the strongest impact on energy consumption in the hospital. The aim of the study is to suggest ways of reducing energy consumption in some of the hospitals. My initial thought was to perform a cluster analysis to cluster hospitals according to some basic characteristics like type/floor area/number of patients. I could then do a regression analysis separately for each of the 3 or 4 clusters identified to determine which of the remaining characteristics are most influential for each cluster. My reasoning behind this is that there are certain characteristics of a hospital which will definitely impact the energy consumption, but also unchangeable (e.g. kicking out a bunch of patients may reduce energy consumption but would generally be frowned-upon!). Does this sound like a reasonable approach? Am I violating any statistical assumptions by first doing a cluster analysis and then following up with regression on each cluster? I realise I could just do a regression in the first place, but I suspect that the effect of any of the less obvious variables will be lost in the presence of the main variables.
Solved – Cluster analysis followed by regression
clusteringmultiple regression
Related Solutions
A perspective from Gelman & Hill's Data Analysis Using Regression and Multilevel/Hierarchical Models may point in a helpful direction. G&H entirely eschew the terms 'fixed' and 'random', arguing that these are misleading (see pp. 245-6). Instead, they emphasize describing the model itself, in terms of the assumptions it embodies, as they relate to our understanding of real-world phenomena.
Turning attention to that latter principle, I would offer the following direct response to the question about separate models. If your 7 hospitals were: (1) a maternity hospital with a neonatal intensive care unit, (2) an academic childrens' hospital, (3) a small-animal veterinary hospital, (4) a veterinary unit of an aquarium, (5) a specialty cancer care hospital, (6) a rural community hospital and (7) an academic medical center, then clearly separate models are indicated! However, to the extent that you think your analytical aims (exploratory? testing theory-based hypotheses?) will be served by sharing information between the separate hospitals, then unifying them in a hierarchical model makes sense. So, if your 7 hospitals seem similar enough with respect to the mental picture underlying your approach to your scientific questions, then you may well want to share information between your 7 models. Hierarchical modeling can be understood as a way to connect the 7 hospital models to achieve that information sharing.
It should be noted that a ready analogy presents itself here to the question about whether to discretize a continuous variable. If you use an infant/child/adult categorical variable in a model, you lose the opportunity to share information about 17 year-old patients with information about 22 year-old patients, which you might accomplish with splines on a continuous age variable. But of course, in an analysis of the patients (mothers, neonates) at a maternity hospital, there are limited opportunities to share information in this way between two clearly distinct categories of patient.
I think the answer may be fairly simple. Let's say you have 10 physical variables and 10 demographic variables. And, you can include all 20 variables in your model without running into any multicollinearity issues and statistical significance issues (all variables are statistically significant). In such a situation, the order of your variables make no difference since you are able to include them all. However, such a situation may be highly unlikely.
You are more likely to run into issues of statistical significance and multicollinearity. Those issues will force you to remove or not select some of the variables of either types. And, in such a situation the order will have a material impact on not only the selected variables in the model, but also both their regression coefficient and standardized coefficient. In other words, the order affects everything the minute you deal with a model that does not include all the variables or that you compare similar model that do not have an identical variable selection. But, if your model includes all 20 just fine, whether you start selecting them from 1 to 20 or 20 to 1 makes no difference.
Best Answer
Your suggestion is close to multi-level regression.
Find more explanation for example here:
http://assets.cambridge.org/97805218/67061/excerpt/9780521867061_excerpt.pdf
The gist is that the population (in your case hospitals) is not homogeneous, but that there are subgroups (levels) that can be identified. Multi-level regression in practice allows for different models per group, and insight into the difference between groups.
The difference is that you will be forming the groups based on a cluster analysis.