Solved – Does stepwise regression provide a biased estimate of population r-square

biasmodel selectionr-squaredregressionstepwise regression

In psychology and other fields a form of stepwise regression is often employed that involves the following:

  1. Look at remaining predictors (there are none in the model at first) and identify the predictor that results in the largest r-square change;
  2. If the p-value of the r-square change is less than alpha (typically .05), then include that predictor and go back to step 1, otherwise stop.

For example, see this procedure in SPSS.

The procedure is routinely critiqued for a wide range of reasons (see this discussion on the Stata website with references).

In particular, the Stata website summarises several comments by Frank Harrell. I'm interested in the claim:

[stepwise regression] yields R-squared values that are badly biased to be high.

Specifically, some of my current research focuses on estimating population r-square. By population r-square I refer to the percentage of variance explained by the population data generating equation in the population. Much of the existing literature I am reviewing has used stepwise regression procedures and I want to know whether the estimates provided are biased and if so by how much. In particular, a typical study would have 30 predictors, n = 200, alpha of entry of .05, and r-square estimates around .50.

What I do know:

  • Asymptotically, any predictor with a non-zero coefficient would be a statistically significant predictor, and r-square would equal adjusted r-square. Thus, asymptotically stepwise regression should estimate the true regression equation and the true population r-square.
  • With smaller sample sizes, the possible omission of some predictors will result in a smaller r-square than had all predictors been included in the model. But also the usual bias of r-square to sample data would increase the r-square. Thus, my naive thought is that potentially, these two opposing forces could under certain conditions result in an unbiased r-square. And more generally, the direction of the bias would be contingent on various features of the data and the alpha inclusion criteria.
  • Setting a more stringent alpha inclusion criterion (e.g., .01, .001, etc.) should lower expected estimated r-square because the probability of including any predictor in any generation of the data will be less.
  • In general, r-square is an upwardly biased estimate of population r-square and the degree of this bias increases with more predictors and smaller sample sizes.

Question

So finally, my question:

  • To what extent does the r-square from stepwise regression result in a biased estimate of population r-square?
  • To what extent is this bias related to sample size, number of predictors, alpha inclusion criterion or properties of the data?
  • Are there any references on this topic?

Best Answer

Referenced in my book, there is a literature showing that to get a nearly unbiased estimate of $R^2$ when doing variable selection, one needs to insert into the formula for adjusted $R^2$ the number of candidate predictors, not the number of "selected" predictors. Therefore, biases caused by variable selection are substantial. Perhaps more importantly, variable selection results in worse real $R^2$ and an inability to actually find the "right" variables.