It is often stated that the square of the sample correlation $r^2$ is equivalent to the $R^2$ coefficient of determination for simple linear regression. I have been unable to demonstrate this myself and would appreciate a full proof of this fact.
Regression – Equivalence of Sample Correlation and R Statistic for Simple Linear Regression
correlationregression
Related Solutions
Re-arrange the problem in terms of new variables, so that $1\leq z_1<z_2<\dots<z_n\leq U$. Then we have $(x_i,y_i)=(x_i,z_{x_i})$, as @whuber pointed out in the comments. Thus you are effectively regressing $z_j$ on $j$, and $r_{xy}=r_{xz}$. Thus if we can work out the marginal distribution for $z_j$, and show that it is basically linear in $j$ the problem is done, and we will have $r_{xy}\sim 1$.
We first need the joint distribution for $z_1,\dots,z_n$. This is quite simple, after you have the solution, but I found it not straight forward before I did the maths. Just a brief lesson in doing maths paying off - so I will present the maths first, then the easy answer.
Now, the original joint distribution is $p(y_1,\dots,y_n)\propto 1$. Changing variables simply relabel things for discrete probabilities, and so the probability is still constant. However, the labelling is not 1-to-1, thus we can't simply write $p(z_1,\dots,z_n)=\frac{(U-n)!}{U!}$. Instead, we have
$$\begin{array}\\p(z_1,\dots,z_n)=\frac{1}{C} & 1\leq z_1<z_2<\dots<z_n\leq U\end{array}$$
And we can find $C$ by normalisation $$C=\sum_{z_n=n}^{U}\sum_{z_{n-1}=n-1}^{z_n-1}\dots\sum_{z_2=2}^{z_3-1}\sum_{z_1=1}^{z_2-1}(1)=\sum_{z_n=n}^{U}\sum_{z_{n-1}=n-1}^{z_n-1}\dots\sum_{z_2=2}^{z_3-1}(z_2-1)$$ $$=\sum_{z_n=n}^{U}\sum_{z_{n-1}=n-1}^{z_n-1}\dots\sum_{z_3=2}^{z_4-1}\frac{(z_3-1)(z_3-2)}{2}=\sum_{z_n=n}^{U}\dots\sum_{z_4=4}^{z_5-1}\frac{(z_4-1)(z_4-2)(z_4-3)}{(2)(3)}$$ $$=\sum_{z_n=n}^{U}\sum_{z_{n-1}=n-1}^{z_n-1}\dots\sum_{z_{j}=j}^{z_{j+1}-1}{z_j-1 \choose j-1}={U \choose n}$$
Which shows the relabeling ratio is equal to $\frac{(U-n)!}{U!}{U \choose n}=\frac{1}{n!}$ - for each $(z_1,\dots,z_n)$ there is $n!$ $(y_1,\dots,y_n)$ values. Makes sense because any permutation of the lables on $y_i$ leads to the same set of ranked $z_i$ values. Now, the marginal distribution $z_1$, we repeat above but with the sum over $z_1$ dropped, and a different range of summation for the remainder, namely, the minimums change from $(2,\dots,n)$ to $(z_1+1,\dots,z_1+n-1)$, and we get:
$$p(z_1)=\sum_{z_n=z_1+n-1}^{U}\;\;\sum_{z_{n-1}=z_1+n-2}^{z_n-1}\dots\sum_{z_2=z_1+1}^{z_3-1}p(z_1,z_2,\dots,z_n)=\frac{{U-z_1 \choose n-1}}{{U \choose n}}$$
With support $z_1\in\{1,2,\dots,U+1-n\}$. This form, combined with a bit of intuition shows that the marginal distribution of any $z_j$ can be reasoned out by:
- choosing $j-1$ values below $z_j$, which can be done in ${z_j-1\choose j-1}$ ways (if $z_j\geq j$);
- choosing the value $z_j$, which can be done 1 way; and
- choosing $n-j$ values above $z_j$ which can be done in ${U-z_j\choose n-j}$ ways (if $z_j\leq U+j-n$)
This method of reasoning will effortly generalise to joint distributions, such as $p(z_j,z_k)$ (which can be used to calculate the expected value of the sample covariance if you want). Hence we have:
$$\begin{array}{c c}\\p(z_j)=\frac{{z_j-1\choose j-1}{U-z_j\choose n-j}}{{U \choose n}} & j\leq z_j\leq U+j-n \\p(z_j,z_k)=\frac{{z_j-1\choose j-1}{z_k-z_j-1 \choose k-j-1}{U-z_k\choose n-k}}{{U \choose n}} & j\leq z_j\leq z_k+j-k\leq U+j-n \end{array}$$
Now the marginal is the pdf of a negative hypergeometric distribution with parameters $k=j,r=n,N=U$ (in terms of the paper's notation). Now this is clear not linear exactly in $j$, but the marginal expectation for $z_j$ is
$$E(z_j)=j\frac{U+1}{n+1}$$
This is indeed linear in $j$, and you would expect beta coefficient of $\frac{U+1}{n+1}$ from the regression, and intercept of zero.
UPDATE
I stopped my answer a bit short before. Have now completed hopefully a more complete answer
Letting $\overline{j}=\frac{n+1}{2}$, and $\overline{z}=\frac{1}{n}\sum_{j=1}^{n}z_j$, the expected square of the sample covariance between $j$ and $z_j$ is given by:
$$E[s_{xz}^2]=E\left[\frac{1}{n}\sum_{j=1}^{n}(j-\overline{j})(z_j-\overline{z})\right]^2$$ $$=\frac{1}{n^2}\left[\sum_{j=1}^{n}(j-\overline{j})^2E(z_j^2)+2\sum_{k=2}^{n}\sum_{j=1}^{k-1}(j-\overline{j})(k-\overline{j})E(z_jz_k)\right]$$
So we need $E(z_j^2)=V(z_j)+E(z_j)^2=Aj^2+Bj$, where $A=\frac{(U+1)(U+2)}{(n+1)(n+2)}$ and $B=\frac{(U+1)(U-n)}{(n+1)(n+2)}$ (using the formula in the pdf file). So the first sum becomes
$$\sum_{j=1}^{n}(j-\overline{j})^2E(z_j^2)=\sum_{j=1}^{n}(j^2-2j\overline{j}+\overline{j}^2)(Aj^2+Bj)$$ $$=\frac{n(n-1)(U+1)}{120}\bigg( U(2n+1)+(3n-1)\bigg)$$
We also need $E(z_jz_k)=E[z_j(z_k-z_j)]+E(z_j^2)$.
$$E[z_j(z_k-z_j)]=\sum_{z_k=k}^{U+k-n}\sum_{z_j=j}^{z_k+j-k}z_j(z_k-z_j) p(z_j,z_k)$$ $$=j(k-j)\sum_{z_k=k}^{U+k-n}\sum_{z_j=j}^{z_k+j-k}\frac{{z_j\choose j}{z_k-z_j \choose k-j}{U-z_k\choose n-k}}{{U \choose n}}=j(k-j)\sum_{z_k=k}^{U+k-n}\frac{{z_k+1 \choose k+1}{U+1-(z_k+1)\choose n-k}}{{U \choose n}}$$ $$=j(k-j)\frac{{U+1\choose n+1}}{{U \choose n}}=j(k-j)\frac{U+1}{n+1}$$ $$\implies E(z_jz_k)=jk\frac{U+1}{n+1}+j^2\frac{(U+1)(U-n)}{(n+1)(n+2)}+j\frac{(U+1)(U-n)}{(n+1)(n+2)}$$
And the second sum is:
$$2\sum_{k=2}^{n}\sum_{j=1}^{k-1}(j-\overline{j})(k-\overline{j})E(z_jz_k)$$ $$=\frac{n(U+1)(n-1)}{720(n+2)}\bigg(6(U-n)(n^3-2n^2-9n-2) + (n+2)(5 n^3- 24 n^2- 35 n +6)\bigg)$$
And so after some rather tedious manipulations, you get for the expected value of the squared covariance of:
$$E[s_{xz}^2]=\frac{(n-1)(n-2)U(U+1)}{120}-\frac{(U+1)(n-1)(n^3+2n^2+11n+22)}{720(n+2)}$$
Now if we have $U>>n$, then the first term dominates as it is $O(U^2n^2)$, whereas the second term is $O(Un^3)$. We can show that the dominant term is well approximated by $E[s_{x}^2s_{z}^2]$, and we have another theoretical reason why the pearson correlation is very close to $1$ (beyond the fact that $E(z_j)\propto j$).
Now the expected sample variance of $j$ is just the sample variance, which is $s_x^2=\frac{1}{n}\sum_{j=1}^{n}(j-\overline{j})^2=\frac{(n+1)(n-1)}{12}$. The expected sample variance for $z_j$ is given by:
$$E[s_z^2]=E\left[\frac{1}{n}\sum_{j=1}^{n}(z_j-\overline{z})^2\right]=\frac{1}{n}\sum_{j=1}^{n}E(z_j^2)-\left[\frac{1}{n}\sum_{j=1}^{n}E(z_j)\right]^2$$ $$=\frac{A(n+1)(2n+1)}{6}+\frac{B(n+1)}{2}-\frac{(U+1)^2}{4}$$ $$=\frac{(U+1)(U-1)}{12}$$
Combining everything together, and noting that $E[s_x^2s_z^2]=s_x^2E[s_z^2]$, we have:
$$E[s_x^2s_z^2]=\frac{(n+1)(n-1)(U+1)(U-1)}{144}\approx \frac{(n-1)(n-2)U(U+1)}{120}\approx E[s_{xz}^2]$$
Which is approximately the same thing as $E[r_{xz}^2]\approx 1$
The coefficient-of-determination can be determined from the correlations: Consider a multiple linear regression with $m$ explanatory vectors and an intercept term. First we define the correlation values for all the variables in the problem $r_i = \mathbb{Corr}(\mathbf{y},\mathbf{x}_i)$ and $r_{i,j} = \mathbb{Corr}(\mathbf{x}_i,\mathbf{x}_j)$. Now define the goodness of fit vector and design correlation matrix respectively by:
$$\boldsymbol{r}_{\mathbf{y},\mathbf{x}} = \begin{bmatrix} r_1 \\ r_2 \\ \vdots \\ r_m \end{bmatrix} \quad \quad \quad \boldsymbol{r}_{\mathbf{x},\mathbf{x}} = \begin{bmatrix} r_{1,1} & r_{1,2} & \cdots & r_{1,m} \\ r_{2,1} & r_{2,2} & \cdots & r_{2,m} \\ \vdots & \vdots & \ddots & \vdots \\ r_{m,1} & r_{m,2} & \cdots & r_{m,m} \\ \end{bmatrix}.$$
The goodness-of-fit vector contains the correlations between the response vector and each of the explanatory vectors. The design correlation matrix contains the correlations between each pair of explanatory vectors. (Please note that these names are something I have made up, since neither matrix has a standard name that I am aware of. The first vector measures the goodness-of-fit of simple regressions on each of the individual explanatory vectors, which is why I use this name.) Now, with a bit of linear algebra it can be shown that the coefficient-of-determination for the multiple linear regression is given by the following quadratic form:
$$R^2 = \boldsymbol{r}_{\mathbf{y},\mathbf{x}}^\text{T} \boldsymbol{r}_{\mathbf{x},\mathbf{x}}^{-1} \boldsymbol{r}_{\mathbf{y},\mathbf{x}}.$$
This form for the coefficient-of-determination is not all that well-known to statistical practitioners, but it is a very useful result, and assists in framing the goodness-of-fit of the multiple linear regression in its most fundamental terms. The square-root of the coefficient of determination gives us the multiple correlation coefficient, which is a multivariate extension of the absolute correlation. In the special case where $m=1$ you get $R^2 = r_1^2$ so that the coefficient-of-determination is the square of the correlation between the response vector and the (single) explanatory variable.
As you can see, this form for the coefficient-of-determination for the multiple linear regression is framed fully in terms of correlations between the pairs of vectors going into the regression. This means that if you have a matrix of the pairwise correlations between all the vectors in the multiple regression (the response vector and each of the explanatory vectors) then you can directly determine the coefficient-of-determination without fitting the regression model. This result is more commonly presented in multivariate analysis (see e.g., Mardia, Kent and Bibby 1979, p. 168).
The coefficient-of-determination is not generally equal to the sum of individual coefficients: In the case where all the explanatory vectors are uncorrelated with each other you get $\boldsymbol{r}_{\mathbf{x},\mathbf{x}} = \boldsymbol{I}$ which means that the above quadratic form reduces to $R^2 = \sum r_i^2$. However, this is a special case that only arises in practice in cases where the explanatory variables are set by the researcher. The explanatory variables are not generally uncorrelated, and so the coefficient-of-determination is determined by the above quadratic form.
It is also useful to note that the coefficient-of-determination in a multiple linear regression can be above or below the sum of the individual coefficients-of-determination for corresponding simple linear regressions. Usually it is below this sum (since the total explanatory power is usually less than the sum of its parts) but sometimes it is above this sum.
Best Answer
There seems to be some variation in notation: in a simple linear regression, I've usually seen the phrase "sample correlation coefficient" with symbol $r$ as a reference to the correlation between observed $x$ and $y$ values. This is the notation I have adopted for this answer. I have also seen the same phrase and symbol used to refer to the correlation between observed $y$ and fitted $\hat y$; in my answer I have referred to this as the "multiple correlation coefficient" and used the symbol $R$. This answer addresses why the coefficient of determination is both the square of $r$ and also the square of $R$, so it shouldn't matter which usage was intended.
The $r^2$ result follows in one line of algebra once some straightforward facts about correlation and the meaning of $R$ are established, so you may prefer to skip down to the boxed equation. I assume we don't have to prove basic properties of covariance and variance, in particular:
$$\text{Cov}(aX+b, Y) = a\text{Cov}(X,Y)$$ $$\text{Var}(aX+b) = a^2\text{Var}(X)$$
Note that the latter can be derived from the former, once we know that covariance is symmetric and that $\text{Var}(X)= \text{Cov}(X,X)$. From here we derive another basic fact, about correlation. For $a \neq 0$, and so long as $X$ and $Y$ have non-zero variances,
$$\begin{align} \text{Cor}(aX+b, Y) &= \frac{\text{Cov}(aX+b, Y)}{\sqrt{\text{Var}(aX+b) \text{Var} (Y)}} \\ &= \frac{a}{\sqrt{a^2}} \times \frac{\text{Cov}(X, Y)}{\sqrt{\text{Var}(X) \text{Var} (Y)}} \\ \text{Cor}(aX+b, Y) &= \text{sgn}(a) \, \text{Cor}(X,Y) \end{align} $$
Here $\text{sgn}(a)$ is the signum or sign function: its value is $\text{sgn}(a) = +1$ if $a>0$ and $\text{sgn}(a) = -1$ if $a<0$. It's also true that $\text{sgn}(a) = 0$ if $a=0$, but that case doesn't concern us: $aX+b$ would be a constant, so $\text{Var}(aX+b) = 0$ in the denominator and we can't calculate the correlation. Symmetry arguments let us generalise this result, for $a, \, c \neq 0$:
$$\text{Cor}(aX+b, \, cY+d) = \text{sgn}(a) \, \text{sgn}(c) \, \text{Cor}(X,Y)$$
We won't need this more general formula to answer the current question, but I include it to emphasise the geometry of the situation: it simply states that correlation is unchanged when either variable is scaled or translated, but reverses in sign when a variable is reflected.
We need one more fact: for a linear model including a constant term, the coefficient of determination $R^2$ is the square of the multiple correlation coefficient $R$, which is the correlation between the observed responses $Y$ and the model's fitted values $\hat Y$. This applies for both multiple and simple regressions, but let us restrict our attention to the simple linear model $\hat Y = \hat \beta_0 + \hat \beta_1 X$. The result follows from the observation that $\hat Y$ is a scaled, possibly reflected, and translated version of $X$:
$$\boxed{R = \text{Cor}(\hat Y, Y) = \text{Cor}(\hat \beta_0 + \hat \beta_1 X, \, Y) = \text{sgn}(\hat \beta_1) \, \text{Cor}(X, Y) = \text{sgn}(\hat \beta_1) \, r}$$
So $R = \pm r$ where the sign matches the sign of the estimated slope, which guarantees $R$ not to be negative. Clearly $R^2 = r^2$.
The preceding argument was made simpler by not having to consider sums of squares. To achieve this, I skipped over the details of the relationship between $R^2$, which we normally think of in terms of sums of squares, and $R$, for which we think about correlations of fitted and observed responses. The symbols make the relationship $R^2 = (R)^2$ seem tautological but this is not the case, and the relationship breaks down if there is no intercept term in the model! I'll give a brief sketch of a geometric argument about the relationship between $R$ and $R^2$ taken from a different question: the diagram is drawn in $n$-dimensional subject space, so each axis (not shown) represents a single unit of observation, and variables are shown as vectors. The columns of the design matrix $\mathbf{X}$ are the vector $\mathbf{1_n}$ (for the constant term) and the vector of observations of the explanatory variable, so the column space is a two-dimensional flat.
The fitted $\mathbf{\hat{Y}}$ is the orthogonal projection of the observed $\mathbf{Y}$ onto the column space of $\mathbf{X}$. This means the vector of residuals $\mathbf{e} = \mathbf{y} - \mathbf{\hat{y}}$ is perpendicular to the flat, and hence to $\mathbf{1_n}$. The dot product is $0 = \mathbf{1_n} \cdot \mathbf{e} = \sum_{i=1}^n e_i$. As the residuals sum to zero and $Y_i = \hat{Y_i} + e_i$, then $\sum_{i=1}^n Y_i = \sum_{i=1}^n \hat{Y_i}$ so that both fitted and observed responses have mean $\bar{Y}$. The dashed lines in the diagram, $\mathbf{Y} - \bar{Y}\mathbf{1_n}$ and $\mathbf{\hat{Y}} - \bar{Y}\mathbf{1_n}$, are therefore the centered vectors for the observed and fitted responses, and the cosine of the angle $\theta$ between them is their correlation $R$.
The triangle these vectors form with the vector of residuals is right-angled since $\mathbf{\hat{Y}} - \bar{Y}\mathbf{1_n}$ lies in the flat but $\mathbf{e}$ is orthogonal to it. Applying Pythagoras:
$$\|\mathbf{Y} - \bar{Y}\mathbf{1_n}\|^2 = \|\mathbf{Y} - \mathbf{\hat{Y}}\|^2 + \|\mathbf{\hat{Y}} - \bar{Y}\mathbf{1_n}\|^2 $$
This is just the decomposition of the sums of squares, $SS_{\text{total}} = SS_{\text{residual}} + SS_{\text{regression}}$. The conventional formula for the coefficient of determination is $1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}}$ which in this triangle is $1 - \sin^2 \theta = \cos^2 \theta$ so is indeed the square of $R$. You may be more familiar with the formula $R^2 = \frac{SS_{\text{regression}}}{SS_{\text{total}}}$, which immediately gives $\cos^2 \theta$, but note that $1 - \frac{SS_{\text{residual}}}{SS_{\text{total}}}$ is more general, and will (as we've just seen) reduce to $\frac{SS_{\text{regression}}}{SS_{\text{total}}}$ if a constant term is included in the model.