Multiple Regression – Why P-Values are Misleading After Performing Stepwise Selection

data miningmultiple regressionpredictive-modelsstepwise regression

Let's consider for example a linear regression model. I heard that, in data mining, after performing a stepwise selection based on the AIC criterion, it is misleading to look at the p-values to test the null hypothesis that each true regression coefficient is zero. I heard that one should consider all the variables left in the model as having a true regression coefficient different from zero instead. Can anyone explain me why? Thank you.

Best Answer

after performing a stepwise selection based on the AIC criterion, it is misleading to look at the p-values to test the null hypothesis that each true regression coefficient is zero.

Indeed, p-values represent the probability of seeing a test statistic at least as extreme as the one you have, when the null hypothesis is true. If $H_0$ is true, the p-value should have a uniform distribution.

But after stepwise selection (or indeed, after a variety of other approaches to model selection), the p-values of those terms that remain in the model don't have that property, even when we know that the null hypothesis is true.

This happens because we choose the variables that have or tend to have small p-values (depending on the precise criteria we used). This means that the p-values of the variables left in the model are typically much smaller than they would be if we'd fitted a single model. Note that selection will on average pick models that seem to fit even better than the true model, if the class of models includes the true model, or if the class of models is flexible enough to closely approximate the true model.

[In addition and for basically the same reason, the coefficients that remain are biased away from zero and their standard errors are biased low; this in turn impacts confidence intervals and predictions as well -- our predictions will be too narrow for example.]

To see these effects, we can take multiple regression where some coefficients are 0 and some are not, perform a stepwise procedure and then for those models that contain variables that had zero coefficients, look at the p-values that result.

(In the same simulation, you can look at the estimates and the standard deviations for the coefficients and discover the ones that correspond to non-zero coefficients are also impacted.)

In short, it's not appropriate to consider the usual p-values as meaningful.

I heard that one should consider all the variables left in the model as significant instead.

As to whether all the values in the model after stepwise should be 'regarded as significant', I'm not sure the extent to which that's a useful way to look at it. What is "significance" intended to mean then?


Here's the result of running R's stepAIC with default settings on 1000 simulated samples with n=100, and ten candidate variables (none of which are related to the response). In each case the number of terms left in the model was counted:

enter image description here

Only 15.5% of the time was the correct model chosen; the rest of the time the model included terms that were not different from zero. If it's actually possible that there are zero-coefficient variables in the set of candidate variables, we are likely to have several terms where the true coefficient is zero in our model. As a result, it's not clear it's a good idea to regard all of them as non-zero.