Solved – Does stepwise regression provide a biased estimate of population r-square

biasmodel selectionr-squaredregressionstepwise regression

In psychology and other fields a form of stepwise regression is often employed that involves the following:

Look at remaining predictors (there are none in the model at first) and identify the predictor that results in the largest r-square change;
If the p-value of the r-square change is less than alpha (typically .05), then include that predictor and go back to step 1, otherwise stop.

For example, see this procedure in SPSS.

The procedure is routinely critiqued for a wide range of reasons (see this discussion on the Stata website with references).

In particular, the Stata website summarises several comments by Frank Harrell. I'm interested in the claim:

[stepwise regression] yields R-squared values that are badly biased to be high.

Specifically, some of my current research focuses on estimating population r-square. By population r-square I refer to the percentage of variance explained by the population data generating equation in the population. Much of the existing literature I am reviewing has used stepwise regression procedures and I want to know whether the estimates provided are biased and if so by how much. In particular, a typical study would have 30 predictors, n = 200, alpha of entry of .05, and r-square estimates around .50.

What I do know:

Asymptotically, any predictor with a non-zero coefficient would be a statistically significant predictor, and r-square would equal adjusted r-square. Thus, asymptotically stepwise regression should estimate the true regression equation and the true population r-square.
With smaller sample sizes, the possible omission of some predictors will result in a smaller r-square than had all predictors been included in the model. But also the usual bias of r-square to sample data would increase the r-square. Thus, my naive thought is that potentially, these two opposing forces could under certain conditions result in an unbiased r-square. And more generally, the direction of the bias would be contingent on various features of the data and the alpha inclusion criteria.
Setting a more stringent alpha inclusion criterion (e.g., .01, .001, etc.) should lower expected estimated r-square because the probability of including any predictor in any generation of the data will be less.
In general, r-square is an upwardly biased estimate of population r-square and the degree of this bias increases with more predictors and smaller sample sizes.

Question

So finally, my question:

To what extent does the r-square from stepwise regression result in a biased estimate of population r-square?
To what extent is this bias related to sample size, number of predictors, alpha inclusion criterion or properties of the data?
Are there any references on this topic?

Best Answer

Referenced in my book, there is a literature showing that to get a nearly unbiased estimate of $R^2$ when doing variable selection, one needs to insert into the formula for adjusted $R^2$ the number of candidate predictors, not the number of "selected" predictors. Therefore, biases caused by variable selection are substantial. Perhaps more importantly, variable selection results in worse real $R^2$ and an inability to actually find the "right" variables.

Evaluation of analytic adjustments to R-square

@ttnphns referred me to the Yin and Fan (2001) article that compares different analytic methods of estimating $R^2$. As per my question they discriminate between two types of estimators. They use the following terminology:

$\rho^2$: Estimator of the squared population multiple correlation coefficient
$\rho_c^2$: Estimator of the squared population cross-validity coefficient

Their results are summarised in the abstract:

The authors conducted a Monte Carlo experiment to investigate the effectiveness of the analytical formulas for estimating $R^2$ shrinkage, with 4 fully crossed factors (squared population multiple correlation coefficient, number of predictors, sample size, and degree of multicollinearity) and 500 replications in each cell. The results indicated that the most widely used Wherry formula (in both SAS and SPSS) is probably not the most effective analytical formula for estimating $\rho^2$. Instead, the Pratt formula and the Browne formula outperformed other analytical formulas in estimating $\rho^2$ and $\rho_c^2$, respectively.

Thus, the article implies that the Pratt formula (p.209) is a good choice for estimating $\rho^2$:

$$\hat{R}^2=1 - \frac{(N-3)(1 - R^2)}{(N-p-1)} \left[ 1 + \frac{2(1-R^2)}{N-p-2.3} \right]$$

where N is the sample size, and p is the number of predictors.

Empirical estimates of adjustments to R-square

Kromrey and Hines (1995) review empirical estimates of $R^2$ (e.g., cross-validation approaches). They show that such algorithms are inappropriate for estimating $\rho^2$. This makes sense given that such algorithms seem to be designed to estimate $\rho_c^2$. However, after reading this, I still wasn't sure whether some form of appropriately corrected empirical estimate might still perform better than analytic estimates in estimating $\rho^2$.

References

Kromrey, J. D., & Hines, C. V. (1995). Use of empirical estimates of shrinkage in multiple regression: a caution. Educational and Psychological Measurement, 55(6), 901-925.
Yin, P., & Fan, X. (2001). Estimating $R^2$ shrinkage in multiple regression: A comparison of different analytical methods. The Journal of Experimental Education, 69(2), 203-224. PDF

Solved – How to get confidence interval on population r-square change

Population $R^2$

I'm firstly trying to understand the definition of the population R-squared.

Quoting your comment:

Or you could define it asymptotically as the proportion of variance explained in your sample as your sample size approaches infinity.

I think you mean this is the limit of the sample $R^2$ when one replicates the model infinitely many times (with the same predictors at each replicate).

So what is the formula for the asymptotic value of the sample $R^²$ ? Write your linear model $\boxed{Y=\mu+\sigma G}$ as in https://stats.stackexchange.com/a/58133/8402, and use the same notations as this link.
Then one can check that the sample $R^2$ goes to $\boxed{popR^2:=\dfrac{\lambda}{n+\lambda}}$ when one replicates the model $Y=\mu+\sigma G$ infinitely many times.

As example:

> ## design of the simple regression model lm(y~x0)
> n0 <- 10
> sigma <- 1
> x0 <- rnorm(n0, 1:n0, sigma)
> a <- 1; b <- 2 # intercept and slope
> params <- c(a,b)
> X <- model.matrix(~x0)
> Mu <- (X%*%params)[,1]
> 
> ## replicate this experiment k times 
> k <- 200
> y <- rep(Mu,k) + rnorm(k*n0)
> # the R-squared is:
> summary(lm(y~rep(x0,k)))$r.squared 
[1] 0.971057
> 
> # theoretical asymptotic R-squared:
> lambda0 <- crossprod(Mu-mean(Mu))/sigma^2
> lambda0/(lambda0+n0)
          [,1]
[1,] 0.9722689
> 
> # other approximation of the asymptotic R-squared for simple linear regression:
> 1-sigma^2/var(y)
[1] 0.9721834

Population $R^2$ of a submodel

Now assume the model is $\boxed{Y=\mu+\sigma G}$ with $H_1\colon\mu \in W_1$ and consider the submodel $H_0\colon \mu \in W_0$.

Then I said above that the population $R^2$ of model $H_1$ is $\boxed{popR^2_1:=\dfrac{\lambda_1}{n+\lambda_1}}$ where $\boxed{\lambda_1=\frac{{\Vert P_{Z_1} \mu\Vert}^2}{\sigma^2}}$ and $Z_1=[1]^\perp \cap W_1$ and then one simply has ${\Vert P_{Z_1} \mu\Vert}^2=\sum(\mu_i - \bar \mu)^2$.

Now do you define the population $R^2$ of the submodel $H_0$ as the asymptotic value of the $R^2$ calculated with respect to model $H_0$ but under the distributional assumption of model $H_1$ ? The asymptotic value (if there is one) seems more difficult to find.