Solved – Dumthe variables and intercept in Cox regression

cox-modelpredictionsurvival

I am working with the Cox Proportional Hazards model. Where the covariates include 2 categorical variables. Assume each category has 3 levels, so I model these in terms of dummy variables.

Category A, level 1, 2, 3 correspond to dummy variables $A_1$,$A_2$ and $A_3$, respectively.

Similarly, category B have dummy variables $B_1$,$B_2$ and $B_3$.

To avoid linearly dependent covariates, the model is represented as follows:
$\text{Intercept} + A_2 + A_3 + B_2 + B_3$.

However, the intercept is "included" in the baseline hazard, so the final model for the cox regression is just
$A_2 + A_3 + B_2 + B_3$.

Now, if I wanted to predict the survival curve of a subject within $A_1$, $B_1$, this would correspond to the baseline survival curve?

Working in R, everything looks fine and most predicted survival curves have good fit with data (estimated by KM), except for the "$A_1$,$B_1$"-cohorts (and also "$A_2$" it seems). And it is just not that the fit is poor, its clearly suboptimal. The curve is shifted upwards or downwards, thus increasing or decreasing the corresponding coefficients would clearly result in a better fit.

Best Answer

I'm reluctant to code dummy variables myself, as you seem to have done, when R can do it for me; it's too easy to make an unexpected error. So rule out a coding problem first.

In your case, with categorical variables A and B, set them as factor variables with 3 levels each, and make sure the baseline level for each (A1 or B1) is set as the the reference level of the factor (e.g., relevel(A,ref=A1)). Then your survival analysis takes the general form:

coxph(Surv(daysSurvival,status) ~ A + B + ..., data= yourData, ...)

The baseline model returned will be for the A1 and B1 reference values, and the coefficients for the other levels of A and B represent the additional hazard with respect to the A1/B1 cases.

At that point if your model doesn't fit the data well, you will know that your model rather than the coding is inadequate. Most important, use the tools available for coxph models to check on validity of the proportional-hazards assumption.