I am interested in getting an unbiased estimate of $R^2$ in a multiple linear regression.
On reflection, I can think of two different values that an unbiased estimate of $R^2$ might be trying to match.
- Out of sample $R^2$: the r-square that would be obtained if the regression equation obtained from the sample (i.e., $\hat{\beta}$) were applied to an infinite amount of data external to the sample but from the same data generating process.
- Population $R^2$: The r-square that would be obtained if an infinite sample were obtained and the model fitted to that infinite sample (i.e., $\beta$) or alternatively just the R-square implied by the known data generating process.
I understand that adjusted $R^2$ is designed to compensate for the overfitting observed in sample $R^2$. Nonetheless, it's not clear whether adjusted $R^2$ is actually an unbiased estimate of $R^2$, and if it is an unbiased estimate, which of the above two definitions of $R^2$ it is aiming to estimate.
Thus, my questions:
- What is an unbiased estimate of what I call above out of sample $R^2$?
- What is an unbiased estimate of what I call above population $R^2$?
- Are there any references that provide simulation or other proof of the unbiasedness?
Best Answer
Evaluation of analytic adjustments to R-square
@ttnphns referred me to the Yin and Fan (2001) article that compares different analytic methods of estimating $R^2$. As per my question they discriminate between two types of estimators. They use the following terminology:
Their results are summarised in the abstract:
Thus, the article implies that the Pratt formula (p.209) is a good choice for estimating $\rho^2$:
$$\hat{R}^2=1 - \frac{(N-3)(1 - R^2)}{(N-p-1)} \left[ 1 + \frac{2(1-R^2)}{N-p-2.3} \right]$$
where N is the sample size, and p is the number of predictors.
Empirical estimates of adjustments to R-square
Kromrey and Hines (1995) review empirical estimates of $R^2$ (e.g., cross-validation approaches). They show that such algorithms are inappropriate for estimating $\rho^2$. This makes sense given that such algorithms seem to be designed to estimate $\rho_c^2$. However, after reading this, I still wasn't sure whether some form of appropriately corrected empirical estimate might still perform better than analytic estimates in estimating $\rho^2$.
References