Solved – Shrunken $r$ vs unbiased $r$: estimators of $\rho$

correlationestimatorspearson-rpoint-estimationunbiased-estimator

There has been some confusion in my head about two types of estimators of the population value of Pearson correlation coefficient.

A. Fisher (1915) showed that for bivariate normal population empirical $r$ is a negatively biased estimator of $\rho$, although the bias can be of practically considerable amount only for small sample size ($n<30$). Sample $r$ underestimates $\rho$ in the sense that it is closer to $0$ than $\rho$. (Except when the latter is $0$ or $\pm 1$, for then $r$ is unbiased.) Several nearly unbiased estimators of $\rho$ has been proposed, the best one probably being Olkin and Pratt (1958) corrected $r$:

$$r_\text{unbiased} = r \left [1+\frac{1-r^2}{2(n-3)} \right ]$$

B. It is said that in regression observed $R^2$ overestimates the corresponding population R-square. Or, with simple regression, it is that $r^2$ overestimates $\rho^2$. Based on that fact, I've seen many texts saying that $r$ is positively biased relative to $\rho$, meaning absolute value: $r$ is farther from $0$ than $\rho$ (is that statement true?). The texts say it is the same problem as the over-estimation of the standard deviation parameter by its sample value. There exist many formulas to "adjust" observed $R^2$ closer to its population parameter, Wherry's (1931) $R_\text{adj}^2$ being the most well-known (but not the best). The root of such adjusted $r_\text{adj}^2$ is called shrunken $r$:

$$r_\text{shrunk} = \pm\sqrt{1-(1-r^2)\frac{n-1}{n-2}}$$

Present are two different estimators of $\rho$. Very different: the first one inflates $r$, the second deflates $r$. How to reconcile them? Where to use/report one and where – the other?

In particular, can it be true that the "shrunken" estimator is (nearly) unbiased too, like the "unbiased" one, but only in the different context – in the asymmetrical context of regression. For, in OLS regression we consider the values of one side (the predictor) as fixed, attending without random error from sample to sample? (And to add here, regression does not need bivariate normality.)

Best Answer

Regarding the bias in the correlation: When sample sizes are small enough for bias to have any practical significance (e.g., the n < 30 you suggested), then bias is likely to be the least of your worries, because inaccuracy is terrible.

Regarding the bias of R2 in multiple regression, there are many different adjustments that pertain to unbiased population estimation vs. unbiased estimation in an independent sample of equal size. See Yin, P. & Fan, X. (2001). Estimating R2 shrinkage in multiple regression: A comparison of analytical methods. The Journal of Experimental Education, 69, 203-224.

Modern-day regression methods also address the shrinkage of regression coefficients as well as R2 as a consequence -- e.g., the elastic net with k-fold cross validation, see http://web.stanford.edu/~hastie/Papers/elasticnet.pdf.