Solved – Relationship between $R^2$ and correlation coefficient

correlationr-squared

Let's say I have two 1-dimensional arrays, $a_1$ and $a_2$. Each contains 100 data points. $a_1$ is the actual data, and $a_2$ is the model prediction. In this case, the $R^2$ value would be:
$$
R^2 = 1 – \frac{SS_{res}}{SS_{tot}} \quad\quad\quad\quad\quad\ \ \quad\quad(1).
$$

In the meantime, this would be equal to the square value of the correlation coefficient,
$$
R^2 = (\text{Correlation Coefficient})^2 \quad (2).
$$

Now if I swap the two: $a_2$ is the actual data, and $a_1$ is the model prediction. From equation $(2)$, because correlation coefficient does not care which comes first, the $R^2$ value would be the same. However, from equation $(1)$, $SS_{tot}=\sum_i(y_i – \bar y )^2$, the $R^2$ value will change, because the $SS_{tot}$ has changed if we switch $y$ from $a_1$ to $a_2$; in the meantime, $SS_{res}=\sum_i(y_i -f_i)^2$ does not change.

My question is: How can these contradict each other?

Edit:

  1. I was wondering that, will the relationship in Eq. (2) still stand, if it is not a simple linear regression, i.e., the relationship between IV and DV is not linear (could be exponential / log)?

  2. Will this relationship still stand, if the sum of the prediction errors does not equal zero?

Best Answer

This is true that $SS_{tot}$ will change ... but you forgot the fact that the regression sum of of squares will change as well. So let's consider the simple regression model and denote the Correlation Coefficient as $r_{xy}^2=\dfrac{S_{xy}^2}{S_{xx}S_{yy}}$, where I used the sub-index $xy$ to emphasize the fact that $x$ is the independent variable and $y$ is the dependent variable. Obviously, $r_{xy}^2$ is unchanged if you swap $x$ with $y$. We can easily show that $SSR_{xy}=S_{yy}(R_{xy}^2)$, where $SSR_{xy}$ is the regression sum of of squares and $S_{yy}$ is the total sum of squares where $x$ is independent and $y$ is dependent variable. Therefore: $$R_{xy}^2=\dfrac{SSR_{xy}}{S_{yy}}=\dfrac{S_{yy}-SSE_{xy}}{S_{yy}},$$ where $SSE_{xy}$ is the corresponding residual sum of of squares where $x$ is independent and $y$ is dependent variable. Note that in this case, we have $SSE_{xy}=b^2_{xy}S_{xx}$ with $b=\dfrac{S_{xy}}{S_{xx}}$ (See e.g. Eq. (34)-(41) here.) Therefore: $$R_{xy}^2=\dfrac{S_{yy}-\dfrac{S^2_{xy}}{S^2_{xx}}.S_{xx}}{S_{yy}}=\dfrac{S_{yy}S_{xx}-S^2_{xy}}{S_{xx}.S_{yy}}.$$ Clearly above equation is symmetric with respect to $x$ and $y$. In other words: $$R_{xy}^2=R_{yx}^2.$$ To summarize when you change $x$ with $y$ in the simple regression model, both numerator and denominator of $R_{xy}^2=\dfrac{SSR_{xy}}{S_{yy}}$ will change in a way that $R_{xy}^2=R_{yx}^2.$