Feature Scaling – Scaling Categorical and Ordinal Variables in Cox Regression

categorical-encodingcox-modelfeature-scalingordinal-datasurvival

I have a dataset with nominal (unorderable categories), ordinal (orderable categories), and continuous/numerical variables. I am performing Cox Proportional Hazard Regression using the scikit-survival package in Python.

I have one-hot encoded the nominal variables (values are 0's and 1's). I have ordinal encoded the ordinal variables (values range from 0 to 9). I have standard scaled my numerical variables (values with mean = 0, standard deviation = 1).

Do I need to standard scale my nominal and ordinal variables after I have encoded them?

I have found responses to related issues:

In ref 1, it seems that feature scaling is required. However, ref 2 says that the linear and logistic regression models do not require feature scaling unless there is regularization, so I expect similar is true for CoxPH regression.

Best Answer

There can be numerical problems or slow convergence in fitting a Cox model, as there is an exponentiation of the working value of the linear predictor at each step toward the solution.* See this page for the formula. To avoid that problem, the R coxph() function internally centers and scales all predictors except those it infers to be associated with categorical predictors (those restricted to values of -1, 0, or 1). It then reports coefficient values back in their original scales.

I don't know whether the tools in scikit-survival do such scaling automatically. If not, it would be wise to follow the scaling approach used by the coxph() function. Your choice of how to proceed might affect what is returned as the baseline hazard and new predictions for a Cox model, so play close attention.

As you note, some type of scaling is important in regularized regressions, including Cox regression. You have to think very carefully about whether or how to do that with categorical predictors, however. See this page for an overview.


*The Cross Validated page you cite is an extreme case whose "covariates are in the range [0, d], where d is a possibly unbounded amount." That leads to questions whether such covariates should perhaps have been transformed or modeled more flexibly than with what seems to have been a simple linear term. It seems that the original model on that page hit numeric limits or didn't converge, as centering and scaling don't have any theoretical reasons for providing different final coefficient estimates (consistent with the second page you cite on linear and logistic regression).