Solved – Analysis of hierarchical clustered hospital data

clusteringhierarchical clusteringregression

I am hoping to get some advice from this excellent community on how I might try to proceed with an analysis of patient outcomes for a large conglomerate of hospitals. Essentially the dataset that I have shows one record/row per hospital visit (from admission to discharge) for many different patients at several different hospitals. Patients can appear in the dataset multiple times (once for each hospital visit), but there are no patients that go between hospitals. At each visit, which occur at irregular intervals for each patient (essentially patients come into the hospital as needed), a number of measurements are made that may be covary with X and Y and should be controlled for. I realize that I am facing a correlated/cluster data problem and cannot just proceed with a traditional regression model without taking into account the correlations and hierarchical clustered nature of the data with repeated measurements.

At first, I thought I would simply make hospital and patient each random effects to account for the clustered nature of the data. However, one of the things I'd like to study with a regression model is how moving from hospital A to hospital B impacts (i.e. increases or decreases) the expected outcome Y. Given that I would think that I'd need to make hospital a fixed effect, but if I do that, how do I account for the correlation within hospital?

One thought I had to handle the above is just to fit separate models for each hospital. Is there a benefit or drawback from fitting separate models for each hospital?

Is there any other way to do this?

Thanks in advance for your help!


Added in response to Bill's questions/comments:

Thanks Bill. I appreciate the response. So, I somewhat oversimplified the actual situation in order to encourage more responses. Essentially, I'm trying to take into account the clustered or longitudinal aspects of my data into account. I have a few levels of clusters going on here:

1. Hospitals. Each hospital has different policies that might conceivably result in more positive or negative outcomes or there may be hospitals in areas where patients might just be generally in more poor health (e.g. they may server older populations).

2. Doctors within a hospital may be adhere to hospital policies so there may be correlations among patients from the same doctor. Skilled doctors would have patients with better outcomes, I would expect.

3. Patient observations would obviously be correlated as the same patient will likely have more highly correlated responses than the responses between patients.

Essentially, a consequence of the clustering is that measurement on units within any given cluster (hospitals/doctors/patients) are more similar than measurements on units in different clusters. From what I've been reading it seems like the way to go about this is by making hospitals, doctors, and patients random effects to account for the clustering. If this is the right approach, then it seems I can't make comparisons of the mean responses between hospitals (or any cluster) if I treat them as random effects right?

Best Answer

A perspective from Gelman & Hill's Data Analysis Using Regression and Multilevel/Hierarchical Models may point in a helpful direction. G&H entirely eschew the terms 'fixed' and 'random', arguing that these are misleading (see pp. 245-6). Instead, they emphasize describing the model itself, in terms of the assumptions it embodies, as they relate to our understanding of real-world phenomena.

Turning attention to that latter principle, I would offer the following direct response to the question about separate models. If your 7 hospitals were: (1) a maternity hospital with a neonatal intensive care unit, (2) an academic childrens' hospital, (3) a small-animal veterinary hospital, (4) a veterinary unit of an aquarium, (5) a specialty cancer care hospital, (6) a rural community hospital and (7) an academic medical center, then clearly separate models are indicated! However, to the extent that you think your analytical aims (exploratory? testing theory-based hypotheses?) will be served by sharing information between the separate hospitals, then unifying them in a hierarchical model makes sense. So, if your 7 hospitals seem similar enough with respect to the mental picture underlying your approach to your scientific questions, then you may well want to share information between your 7 models. Hierarchical modeling can be understood as a way to connect the 7 hospital models to achieve that information sharing.

It should be noted that a ready analogy presents itself here to the question about whether to discretize a continuous variable. If you use an infant/child/adult categorical variable in a model, you lose the opportunity to share information about 17 year-old patients with information about 22 year-old patients, which you might accomplish with splines on a continuous age variable. But of course, in an analysis of the patients (mothers, neonates) at a maternity hospital, there are limited opportunities to share information in this way between two clearly distinct categories of patient.