Regression – Does Having Too Many Variables in a Regression Model Affect Inference?

biasfeature selectioninferenceoverfittingregression

Regression models can be used for inference on the coefficients to describe predictor relationships or for prediction about an outcome. I'm aware of the bias-variance tradeoff and know that including too many variables in the regression will cause the model to overfit, making poor predictions on new data. Do these overfitting problems extend to inferences made on the predictors?

Say I'm working with a cancer dataset (n=200) that includes overall survival and several hundred genomic markers. I'm interested in describing the relationship between each marker and survival, and would like to identify markers that show strong evidence of an association with survival. Is it wrong to fit a model with all the markers and clinical factors (age, sex, treatment etc) and then look at hazard ratios, confidence intervals, and p-values to identify "important" predictors? Building a model with hundreds of parameters feels wrong, but I'm not sure if there's an underlying reason why this approach should be avoided. Would this create a multiple comparisons problem? Does sample size play a role in whether this approach is valid?

In my experience some people would use stepwise model selection (using p-values or AIC) to identify important predictors based on the final p-values, but from what I've read stepwise selection overexaggerates p-values and provides unreliable inference due to selection bias. I also try to avoid building univariable models for each predictor because omitted variable bias can create misleading effects estimates.

The results from my model would be hypothesis generating to prioritize gene candidates for experimental study.

Best Answer

One problem with dumping all of your predictors into the model is the invitation to extreme collinearity, which will inflate your standard errors and likely make your results uninterpretable.

Judea Pearl has pointed to a second problem, if your inference is aimed at modeling causal relationships. In trying to "control for everything" by including all available predictors, you may actually "unblock" new confounder paths and move farther away from, not closer to, good estimates of causal relationships. In the language of his graphical system, you create a confound if you condition on a collider or on a descendant of a collider.

A third problem, with your limited sample size, statistical power with so many predictors will be low, which will inflate the likelihood that what seems like a finding now will prove not to be later on, following the reasoning of John Ioannidis (2005).

Related Question