Regression Model Covariates – Including ZIP Code in Regression Analysis

categorical datacox-modelmany-categoriesregression

Background

I have a dataset representing a large group of people that I'm using to specify a Cox proportional hazards model of a binary outcome on some explanatory variables. My outcome variable is a health condition of interest (coded 1/0), and, my "focal" explanatory variable — the one I'm most interested in — is a binary indication of whether a person has been given a certain treatment or not. The other explanatory variables are the usual sociodemographic suspects: age, sex, and a couple others.

In planning my model, I'd like, if possible, so find some way to also control for subjects' geography in the estimation of my exposure variable (I have a reason to think it could matter). At first glance I've got a handy way to do this: a variable 3_digit_zip_code representing — you guessed it — the first 3 digits of a subject's US postal code.

Question / problem

The immediate objection to including ZIP in the model, of course, is that a hazard ratio (HR) for ZIP code would be uninterpretable: what could "a one digit increase" in ZIP code mean in practical terms?

But then I think of an objection to the objection: wouldn't there be some utility of including ZIP in the interpretability of the other covariates in the model, above all the main one, treatment? In other words, the HR for treatment would be saying something useful about the hazard of the outcome between the treated and untreated, given the same levels of the other covariates — right?

I suppose I have two questions, then:

  1. Is it worth it to include ZIP in such a Cox model, even if its HR is uninterpretable, if it adds to "control" in other HRs?

  2. Is there a better way of controlling for geography, e.g. matching of some kind?

Best Answer

Modeling outcome-relevant geographic and demographic covariates directly, as @AdamO suggests, is the best way to go. Another answer suggests ways to start getting such information.

The 3-digit ZIP code poses two problems.

First, without care the numbers might be interpreted as a numeric variable and, as you state, "a one digit increase" in ZIP code would be meaningless. If you are to go with ZIP code as a predictor, you must ensure that it is encoded as a multi-level categorical predictor. That could allow you to use ZIP codes as fixed effects, or as cluster() or frailty/random-effect terms to account for correlations within ZIP codes.

Second, as @AdamO notes in a comment, "The first 3 digits of a zip code doesn't mean anything." The first 3 digits of a ZIP code near where I live includes both wide-open suburban and dense urban localities, areas with some of the highest and lowest average wealth in the state, widely different education levels, and differences in access to transportation and to health-care facilities. If you can't get more detailed geographic and demographic data (e.g., from "ZIP +4" values combined with Census data), at least use the full 5-digit ZIP code.

Related Question