Solved – an unbiased estimate of population R-square

biasestimationmultiple regressionr-squared

I am interested in getting an unbiased estimate of $R^2$ in a multiple linear regression.

On reflection, I can think of two different values that an unbiased estimate of $R^2$ might be trying to match.

Out of sample $R^2$: the r-square that would be obtained if the regression equation obtained from the sample (i.e., $\hat{\beta}$) were applied to an infinite amount of data external to the sample but from the same data generating process.
Population $R^2$: The r-square that would be obtained if an infinite sample were obtained and the model fitted to that infinite sample (i.e., $\beta$) or alternatively just the R-square implied by the known data generating process.

I understand that adjusted $R^2$ is designed to compensate for the overfitting observed in sample $R^2$. Nonetheless, it's not clear whether adjusted $R^2$ is actually an unbiased estimate of $R^2$, and if it is an unbiased estimate, which of the above two definitions of $R^2$ it is aiming to estimate.

Thus, my questions:

What is an unbiased estimate of what I call above out of sample $R^2$?
What is an unbiased estimate of what I call above population $R^2$?
Are there any references that provide simulation or other proof of the unbiasedness?

Best Answer

Evaluation of analytic adjustments to R-square

@ttnphns referred me to the Yin and Fan (2001) article that compares different analytic methods of estimating $R^2$. As per my question they discriminate between two types of estimators. They use the following terminology:

$\rho^2$: Estimator of the squared population multiple correlation coefficient
$\rho_c^2$: Estimator of the squared population cross-validity coefficient

Their results are summarised in the abstract:

The authors conducted a Monte Carlo experiment to investigate the effectiveness of the analytical formulas for estimating $R^2$ shrinkage, with 4 fully crossed factors (squared population multiple correlation coefficient, number of predictors, sample size, and degree of multicollinearity) and 500 replications in each cell. The results indicated that the most widely used Wherry formula (in both SAS and SPSS) is probably not the most effective analytical formula for estimating $\rho^2$. Instead, the Pratt formula and the Browne formula outperformed other analytical formulas in estimating $\rho^2$ and $\rho_c^2$, respectively.

Thus, the article implies that the Pratt formula (p.209) is a good choice for estimating $\rho^2$:

$$\hat{R}^2=1 - \frac{(N-3)(1 - R^2)}{(N-p-1)} \left[ 1 + \frac{2(1-R^2)}{N-p-2.3} \right]$$

where N is the sample size, and p is the number of predictors.

Empirical estimates of adjustments to R-square

Kromrey and Hines (1995) review empirical estimates of $R^2$ (e.g., cross-validation approaches). They show that such algorithms are inappropriate for estimating $\rho^2$. This makes sense given that such algorithms seem to be designed to estimate $\rho_c^2$. However, after reading this, I still wasn't sure whether some form of appropriately corrected empirical estimate might still perform better than analytic estimates in estimating $\rho^2$.

References

Kromrey, J. D., & Hines, C. V. (1995). Use of empirical estimates of shrinkage in multiple regression: a caution. Educational and Psychological Measurement, 55(6), 901-925.
Yin, P., & Fan, X. (2001). Estimating $R^2$ shrinkage in multiple regression: A comparison of different analytical methods. The Journal of Experimental Education, 69(2), 203-224. PDF

Population $R^2$

I'm firstly trying to understand the definition of the population R-squared.

Quoting your comment:

Or you could define it asymptotically as the proportion of variance explained in your sample as your sample size approaches infinity.

I think you mean this is the limit of the sample $R^2$ when one replicates the model infinitely many times (with the same predictors at each replicate).

So what is the formula for the asymptotic value of the sample $R^²$ ? Write your linear model $\boxed{Y=\mu+\sigma G}$ as in https://stats.stackexchange.com/a/58133/8402, and use the same notations as this link.
Then one can check that the sample $R^2$ goes to $\boxed{popR^2:=\dfrac{\lambda}{n+\lambda}}$ when one replicates the model $Y=\mu+\sigma G$ infinitely many times.

As example:

> ## design of the simple regression model lm(y~x0)
> n0 <- 10
> sigma <- 1
> x0 <- rnorm(n0, 1:n0, sigma)
> a <- 1; b <- 2 # intercept and slope
> params <- c(a,b)
> X <- model.matrix(~x0)
> Mu <- (X%*%params)[,1]
> 
> ## replicate this experiment k times 
> k <- 200
> y <- rep(Mu,k) + rnorm(k*n0)
> # the R-squared is:
> summary(lm(y~rep(x0,k)))$r.squared 
[1] 0.971057
> 
> # theoretical asymptotic R-squared:
> lambda0 <- crossprod(Mu-mean(Mu))/sigma^2
> lambda0/(lambda0+n0)
          [,1]
[1,] 0.9722689
> 
> # other approximation of the asymptotic R-squared for simple linear regression:
> 1-sigma^2/var(y)
[1] 0.9721834

Population $R^2$ of a submodel

Now assume the model is $\boxed{Y=\mu+\sigma G}$ with $H_1\colon\mu \in W_1$ and consider the submodel $H_0\colon \mu \in W_0$.

Then I said above that the population $R^2$ of model $H_1$ is $\boxed{popR^2_1:=\dfrac{\lambda_1}{n+\lambda_1}}$ where $\boxed{\lambda_1=\frac{{\Vert P_{Z_1} \mu\Vert}^2}{\sigma^2}}$ and $Z_1=[1]^\perp \cap W_1$ and then one simply has ${\Vert P_{Z_1} \mu\Vert}^2=\sum(\mu_i - \bar \mu)^2$.

Now do you define the population $R^2$ of the submodel $H_0$ as the asymptotic value of the $R^2$ calculated with respect to model $H_0$ but under the distributional assumption of model $H_1$ ? The asymptotic value (if there is one) seems more difficult to find.

Best Answer

Evaluation of analytic adjustments to R-square

Empirical estimates of adjustments to R-square

References

Related Solutions

Solved – Does stepwise regression provide a biased estimate of population r-square

Solved – How to get confidence interval on population r-square change

Population $R^2$

Population $R^2$ of a submodel

Related Question