Solved – Does the ‘reference group’ in a Cox proportional hazards model have to exist

cox-modelsurvival

My understanding is that a hazard ratio from a Cox proportional hazards model compares the effect on the hazard rate of a given factor to a reference group. Does that actual group have to exist in the data?

Say we enroll people in a study of how long before they buy a couch. We right-censor at 3 years. For this example we have two factors: age < 30 or >= 30, whether they own a cat. It turns out the hazard ratio of "owns cat" to the reference group (age < 30, "doesn't own cat") is 1.2, and significant (say p<0.05). Do I know that the data includes people with age < 30 who don't own a cat?

For this example it seems a little crazy, but if there are many factors in the model, it becomes easier to imagine that there might be no data for some combination of factors, and that might end up as the reference group if chosen arbitrarily. In fact I think I've had that happen (in R, if it matters).

Best Answer

Most regression models can smooth over areas where there are no data. This is not necessarily a feature - if there is a reason they don't exist in your data (notably, that there is some reason they cannot), then your model is spitting out a non-sense result.

So yes, a category can have an estimate despite the combination of covariates not actually existing in your data, but you should approach these circumstances with extreme caution.

Related Question