Solved – Correlated variables in Cox model – which one is best

cox-modelmulticollinearity

I am building a Cox model containing around 8 variables. Two of the variables that are different measures of the same thing. Consequently, they are correlated with each other. When included in separate models, both show a strong association with survival.

I have read conflicting opinion regarding inclusion of correlated variables within the same model. Is it acceptable to use correlated variables in the same model?

When they are used in the same model, one variable remains significantly associated with outcome, while the association for the second loses significance. Does this tell me anything about these variables? Is the former variable a better predictor of survival than the second?

Many thanks for any advice.

Best Answer

There is no rule against including correlated predictors in a Cox or a standard multiple regression. In practice it is almost inevitable, particularly in clinical work where there can be multiple standard measures of the severity of disease.

Determining which of 2 "measures of the same thing" is better, however, is difficult. When you have 2 predictors essentially measuring the same thing, the particular predictor that seems to work the best may depend heavily on the particular sample of data that you have on hand.

Bootstrapping as suggested by @smndpln can help show the difficulty. If you run a model including both predictors on multiple bootstrap samples, you might well find that only 1 of the 2 is "significant" in any one bootstrap, but the particular predictor found "significant" is likely to vary from bootstrap to bootstrap. This is an inherent problem with highly correlated predictors, whether in Cox regression or standard multiple regression.

You could try LASSO to see whether either or both of the predictors is maintained in a final model that minimizes cross-validation error, but the particular predictor maintained is also likely to vary among bootstrap sample.

You could try comparing nested models. Run the Cox regression first with the standard predictor, then see whether adding your novel predictor adds significant information with anova() in R or a similar function in other software. Then reverse the order, starting with your novel predictor and seeing whether adding the standard predictor adds anything. But if the 2 predictors are highly correlated, it's unlikely that either will add to what's already provided by the other.

You could also compare the 2 models differing only in which of the 2 predictors is included with the Akaike Information Criterion (AIC). This can show which model is "better" on a particular sample. There are, however, no statistical tests to show how big a difference in AIC is "significant." I suppose you could do this comparison on multiple bootstrap samples to get some measure of "significance," but even then you may be unlikely to find a "significant" difference unless your novel predictor is substantially better than the standard predictor that already measures the same thing. And I would worry about whether any differences you find would necessarily hold in other data samples.

Finally, you might consider proposing a model that includes both measures of the phenomenon in question. For prediction, your model need not be restricted to independent variables that are "significant" by some arbitrary test (unless you have so many predictors that you are in danger of over-fitting). Or you could use ridge regression, which can handle correlated predictors fairly well and minimizes the danger of over-fitting.