Regression – Equivalence of Sample Correlation and R Statistic for Simple Linear Regression

correlationregression

It is often stated that the square of the sample correlation $r^2$ is equivalent to the $R^2$ coefficient of determination for simple linear regression. I have been unable to demonstrate this myself and would appreciate a full proof of this fact.

Best Answer

There seems to be some variation in notation: in a simple linear regression, I've usually seen the phrase "sample correlation coefficient" with symbol $r$ as a reference to the correlation between observed $x$ and $y$ values. This is the notation I have adopted for this answer. I have also seen the same phrase and symbol used to refer to the correlation between observed $y$ and fitted $\hat y$; in my answer I have referred to this as the "multiple correlation coefficient" and used the symbol $R$. This answer addresses why the coefficient of determination is both the square of $r$ and also the square of $R$, so it shouldn't matter which usage was intended.

The $r^2$ result follows in one line of algebra once some straightforward facts about correlation and the meaning of $R$ are established, so you may prefer to skip down to the boxed equation. I assume we don't have to prove basic properties of covariance and variance, in particular:

$$\text{Cov}(aX+b, Y) = a\text{Cov}(X,Y)$$ $$\text{Var}(aX+b) = a^2\text{Var}(X)$$

Note that the latter can be derived from the former, once we know that covariance is symmetric and that $\text{Var}(X)= \text{Cov}(X,X)$. From here we derive another basic fact, about correlation. For $a \neq 0$, and so long as $X$ and $Y$ have non-zero variances,

$$\begin{align} \text{Cor}(aX+b, Y) &= \frac{\text{Cov}(aX+b, Y)}{\sqrt{\text{Var}(aX+b) \text{Var} (Y)}} \\ &= \frac{a}{\sqrt{a^2}} \times \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var} (Y)}} \\ \text{Cor}(aX+b, Y) &= \text{sgn}(a) \, \text{Cor}(X,Y) \end{align} $$

Here $\text{sgn}(a)$ is the signum or sign function: its value is $\text{sgn}(a) = +1$ if $a>0$ and $\text{sgn}(a) = -1$ if $a<0$. It's also true that $\text{sgn}(a) = 0$ if $a=0$, but that case doesn't concern us: $aX+b$ would be a constant, so $\text{Var}(aX+b) = 0$ in the denominator and we can't calculate the correlation. Symmetry arguments let us generalise this result, for $a, \, c \neq 0$:

$$\text{Cor}(aX+b, \, cY+d) = \text{sgn}(a) \, \text{sgn}(c) \, \text{Cor}(X,Y)$$

We won't need this more general formula to answer the current question, but I include it to emphasise the geometry of the situation: it simply states that correlation is unchanged when either variable is scaled or translated, but reverses in sign when a variable is reflected.

We need one more fact: for a linear model including a constant term, the coefficient of determination $R^2$ is the square of the multiple correlation coefficient $R$, which is the correlation between the observed responses $Y$ and the model's fitted values $\hat Y$. This applies for both multiple and simple regressions, but let us restrict our attention to the simple linear model $\hat Y = \hat \beta_0 + \hat \beta_1 X$. The result follows from the observation that $\hat Y$ is a scaled, possibly reflected, and translated version of $X$:

$$\boxed{R = \text{Cor}(\hat Y, Y) = \text{Cor}(\hat \beta_0 + \hat \beta_1 X, \, Y) = \text{sgn}(\hat \beta_1) \, \text{Cor}(X, Y) = \text{sgn}(\hat \beta_1) \, r}$$

So $R = \pm r$ where the sign matches the sign of the estimated slope, which guarantees $R$ not to be negative. Clearly $R^2 = r^2$.

The preceding argument was made simpler by not having to consider sums of squares. To achieve this, I skipped over the details of the relationship between $R^2$, which we normally think of in terms of sums of squares, and $R$, for which we think about correlations of fitted and observed responses. The symbols make the relationship $R^2 = (R)^2$ seem tautological but this is not the case, and the relationship breaks down if there is no intercept term in the model! I'll give a brief sketch of a geometric argument about the relationship between $R$ and $R^2$ taken from a different question: the diagram is drawn in $n$-dimensional subject space, so each axis (not shown) represents a single unit of observation, and variables are shown as vectors. The columns of the design matrix $\mathbf{X}$ are the vector $\mathbf{1_n}$ (for the constant term) and the vector of observations of the explanatory variable, so the column space is a two-dimensional flat.

Vectors in subject space of multiple regression

The fitted $\mathbf{\hat{Y}}$ is the orthogonal projection of the observed $\mathbf{Y}$ onto the column space of $\mathbf{X}$. This means the vector of residuals $\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}$ is perpendicular to the flat, and hence to $\mathbf{1_n}$. The dot product is $0 = \mathbf{1_n} \cdot \mathbf{e} = \sum_{i=1}^n e_i$. As the residuals sum to zero and $Y_i = \hat{Y_i} + e_i$, then $\sum_{i=1}^n Y_i = \sum_{i=1}^n \hat{Y_i}$ so that both fitted and observed responses have mean $\bar{Y}$. The dashed lines in the diagram, $\mathbf{Y} - \bar{Y}\mathbf{1_n}$ and $\mathbf{\hat{Y}} - \bar{Y}\mathbf{1_n}$, are therefore the centered vectors for the observed and fitted responses, and the cosine of the angle $\theta$ between them is their correlation $R$.

The triangle these vectors form with the vector of residuals is right-angled since $\mathbf{\hat{Y}} - \bar{Y}\mathbf{1_n}$ lies in the flat but $\mathbf{e}$ is orthogonal to it. Applying Pythagoras:

$$\|\mathbf{Y} - \bar{Y}\mathbf{1_n}\|^2 = \|\mathbf{Y} - \mathbf{\hat{Y}}\|^2 + \|\mathbf{\hat{Y}} - \bar{Y}\mathbf{1_n}\|^2 $$

This is just the decomposition of the sums of squares, $SS_{\text{total}} = SS_{\text{residual}} + SS_{\text{regression}}$. The conventional formula for the coefficient of determination is $1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}}$ which in this triangle is $1 - \sin^2 \theta = \cos^2 \theta$ so is indeed the square of $R$. You may be more familiar with the formula $R^2 = \frac{SS_{\text{regression}}}{SS_{\text{total}}}$, which immediately gives $\cos^2 \theta$, but note that $1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}}$ is more general, and will (as we've just seen) reduce to $\frac{SS_{\text{regression}}}{SS_{\text{total}}}$ if a constant term is included in the model.

Related Question