Solved – Number of Covariates in Cox PH Model and Overfitting

cox-modelmodel selectionoverfittingregression-strategiessurvival

I have a small time to event dataset (N=20) where patients are given one of two drugs (drug) at varying doses (dose). There are several biomarkers (biomarker1, biomarker2 etc) recorded for each patient included as covariates.

I'd like to estimate the hazard ratios of the biomarkers to determine if they are potentially prognostic of survival. As patients received different treatment regimens, I'm concerned about confounding and want to adjust for treatment when I consider the biomarkers in a Cox regression. If I had a large dataset, I would use a Cox model that adjusts for treatment and dose and includes all the biomarkers of interest

cph(Surv(time, event) ~ treatment + dose + treatment*dose + biomarker1 + biomarker2 ...)

I'm aware that in the development of prediction models, overfitting is a major concern as too many variables can cause overfitting and prevent the model from validating. As I'm interested in effects estimation to understand the data at hand, not building a prediction model, do I need to be concerned about overfitting and whether I should limit the number of included variables? If so, how should I determine how many covariates to include?

Do I need to limit the number of variables included in my model, or can I not worry about overfitting and use my full model with all treatment effects and biomarkers, understanding that the small N may cause my estimates to not replicate in future larger studies?

Best Answer

You did not state the all-important number of events, but clearly it cannot exceed 20.

There is a lower limit of sample size required to estimate even the simplest thing, such as an overall event incidence. If there were no censoring, the sample size required to estimate a simple proportion is 96. That is just to estimate the intercept in a binary logistic model. To have covariates you would clearly need more than 96 subjects. Any analysis of 20 subjects is futile.

A recent paper showing how to calculate sample size when covariates are present is here.

For information about where 96 comes from see BBR and RMS in these links.

Related Question