Solved – an unbiased estimate of population R-square

biasestimationmultiple regressionr-squared

I am interested in getting an unbiased estimate of $R^2$ in a multiple linear regression.

On reflection, I can think of two different values that an unbiased estimate of $R^2$ might be trying to match.

  1. Out of sample $R^2$: the r-square that would be obtained if the regression equation obtained from the sample (i.e., $\hat{\beta}$) were applied to an infinite amount of data external to the sample but from the same data generating process.
  2. Population $R^2$: The r-square that would be obtained if an infinite sample were obtained and the model fitted to that infinite sample (i.e., $\beta$) or alternatively just the R-square implied by the known data generating process.

I understand that adjusted $R^2$ is designed to compensate for the overfitting observed in sample $R^2$. Nonetheless, it's not clear whether adjusted $R^2$ is actually an unbiased estimate of $R^2$, and if it is an unbiased estimate, which of the above two definitions of $R^2$ it is aiming to estimate.

Thus, my questions:

  • What is an unbiased estimate of what I call above out of sample $R^2$?
  • What is an unbiased estimate of what I call above population $R^2$?
  • Are there any references that provide simulation or other proof of the unbiasedness?

Best Answer

Evaluation of analytic adjustments to R-square

@ttnphns referred me to the Yin and Fan (2001) article that compares different analytic methods of estimating $R^2$. As per my question they discriminate between two types of estimators. They use the following terminology:

  • $\rho^2$: Estimator of the squared population multiple correlation coefficient
  • $\rho_c^2$: Estimator of the squared population cross-validity coefficient

Their results are summarised in the abstract:

The authors conducted a Monte Carlo experiment to investigate the effectiveness of the analytical formulas for estimating $R^2$ shrinkage, with 4 fully crossed factors (squared population multiple correlation coefficient, number of predictors, sample size, and degree of multicollinearity) and 500 replications in each cell. The results indicated that the most widely used Wherry formula (in both SAS and SPSS) is probably not the most effective analytical formula for estimating $\rho^2$. Instead, the Pratt formula and the Browne formula outperformed other analytical formulas in estimating $\rho^2$ and $\rho_c^2$, respectively.

Thus, the article implies that the Pratt formula (p.209) is a good choice for estimating $\rho^2$:

$$\hat{R}^2=1 - \frac{(N-3)(1 - R^2)}{(N-p-1)} \left[ 1 + \frac{2(1-R^2)}{N-p-2.3} \right]$$

where N is the sample size, and p is the number of predictors.

Empirical estimates of adjustments to R-square

Kromrey and Hines (1995) review empirical estimates of $R^2$ (e.g., cross-validation approaches). They show that such algorithms are inappropriate for estimating $\rho^2$. This makes sense given that such algorithms seem to be designed to estimate $\rho_c^2$. However, after reading this, I still wasn't sure whether some form of appropriately corrected empirical estimate might still perform better than analytic estimates in estimating $\rho^2$.

References

  • Kromrey, J. D., & Hines, C. V. (1995). Use of empirical estimates of shrinkage in multiple regression: a caution. Educational and Psychological Measurement, 55(6), 901-925.
  • Yin, P., & Fan, X. (2001). Estimating $R^2$ shrinkage in multiple regression: A comparison of different analytical methods. The Journal of Experimental Education, 69(2), 203-224. PDF