Solved – When is $R^2$ the same as Pearson’s $r$ squared

correlationpearson-rr-squaredregression

I am a bit confused about the relationship between Pearson's $r$ and the Coefficient of Determination $R^2$.

$$
r = \frac{\sum\limits_{i = 1}^n (x_i – \overline x)(y_i – \overline y)}{\sqrt{\sum\limits_{i = 1}^n (x_i – \overline x)^2(y_i – \overline y)^2}}
$$

$$
R^2 = 1 – \frac{\sum\limits_{i = 1}^n (x_i – y_i)^2}{\sum\limits_{i = 1}^n (x_i – \overline x)^2}
$$

Let $y$ be the prediction and $x$ be the actual values, and $\overline y$ and $\overline x$ their means.

Most explanations I have read says that $R^2$ can be derived by squaring the Pearson's $r$, and hence the name. However, using the formulas given above, then a squaring of $r$ does not equal $R^2$; at least for the data I have tried with.

Are the formulas wrong, or what is happening?

Wikipedia have a vague statement here: When an intercept is included, then $r^2$ is simply the square of the sample correlation coefficient (i.e., r).

Best Answer

Your confusion lies in the fact that this only works if $x_i = \hat{y_i}$, where $\hat{y_i}$ are the OLS predicted values for $y$ based on $x$ (i.e. linear regression of $y$ on $x$). Try running this regression and using $\hat{y}$ instead of $x$ and it'll work.

The short version is that for this to work, $x$ needs to satisfy several criteria (e.g. $\sum x_i(y_i - x_i) = 0$, etc), which are satisfied when $x_i = \hat{y_i}$.

Related Solutions

Solved – Instability of one-pass algorithm for correlation coefficient

You can experience problems whenever a term like $\sum y_i^2$ or $\sum x_iy_i$ is very large, and yet close to the second term, potentially leading to a large loss in digits of accuracy when almost all of the significant digits cancel. In the case of the variance, it happens when the standard deviation is small compared to the mean.

It's possible to construct one-pass forms for all the terms under the $\sqrt{}$ signs that don't suffer this sort of problem.

There's an example calculation for a variance given here. Similar calculations for covariance can be done.

Regression Analysis – Understanding Strange Definition of Coefficient of Determination

This definition and the common one are equivalent.

First, note that the Wiki you link to has $y_i$ in the denominator, not $f_i$. I agree that the sample mean needs to $i$ index. I assume the original source uses a linear model.

Write $$ \sum_i\left\{(f_{i}-y_{i})^{2}+(f_{i}-\bar{y}_{i})^{2}\right\}=\sum_i\left\{f_i^2-2f_iy_i+y_i^2+f_i^2-2f_i\bar{y}+\bar{y}^2\right\}. $$ We may write, in matrix notation, $$\sum_i f_iy_i=y'P'y,$$ where $P=X(X'X)^{-1}X'$ is the projection matrix turning $y$ into fitted values $f$, $f=Py$. $P$ is symmetric and idempotent (see e.g. Interpretation of $\mathbf{y}^T(\mathbf{I}-\mathbf{H})\mathbf{y}$ in OLS), so that $$\sum_i f_iy_i=y'P'y=y'P'Py=f'f=\sum_if_i^2.$$

Hence, the equation simplifies to $$ \sum_i\left\{(f_{i}-y_{i})^{2}+(f_{i}-\bar{y}_{i})^{2}\right\}=\sum_i\left\{y_i^2-2f_i\bar{y}+\bar{y}^2\right\}. $$ Now, write $\sum_if_i=\iota'Py$ for a vector of ones $\iota$. If the regressors contain a constant (see also Proof that the mean of predicted values in OLS regression is equal to the mean of original values?), $\iota'P=\iota'$ (the fitted values of a regression of $\iota$ on itself and something else will be itself), so that $\sum_if_i=\iota'y=\sum_iy_i$. Thus, $$ \sum_i\left\{y_i^2-2y_i\bar{y}+\bar{y}^2\right\}=\sum_i(y_i-\bar{y})^2 $$

Quick numerical verification:

n <- 10
y <- rnorm(n)
x <- rnorm(n)
limo <- lm(y~x)

> summary(limo)$r.squared
[1] 0.03260629

> 1-sum(resid(limo)^2)/(sum(resid(limo)^2)+sum((fitted(limo)-mean(y))^2))
[1] 0.03260629

Best Answer

Related Solutions

Solved – Instability of one-pass algorithm for correlation coefficient

Regression Analysis – Understanding Strange Definition of Coefficient of Determination

Related Question