[Math] Expected Value of R squared

probability theorystatistics

Let $n$ be a fixed positive integer. Generate $n$ numbers $x_1, x_2, …, x_n$ from the set $[0,1]$, with the probability distribution being the uniform one and the $x_i$ all being independent of each other. Now repeat this process to generate $y_1, …, y_n$. If we let $X$ be a random variable which takes on $x_1, …, x_n$ with probability $\frac{1}{n}$ each, and let $Y$ be a random variable which takes on $y_i$ whenever $X$ takes on a value of $x_i$. We can then compute the square of the correlation $R^2$ between $X$ and $Y$. What is the expected value of this $R^2$?

Another less rigorous phrasing of the problem is this: suppose we throw $n$ points at random on a graph spanning $[0,1] \times [0,1]$. What is the expected value of the $R^2$ of the line of best fit?

For instance, for $n=2$ the expected value is $1$ due to the $R^2$ value always being $1$. For $n=3$ one can numerically compute the expected value to be $\frac{1}{2}$. In general, it seems that the answer is $\frac{1}{n-1}$. I don't really have any idea how to do this problem in general; and even specific cases look nontrivial. Does anyone have any ideas? This looks like what should be a well-known result, but my searching didn't pick up on anything which looked useful.

This has applications in that when one is working with variables which are not expected to be very highly correlated, it is often difficult to tell when an $R^2$ value is significant. This result gives an idea of how big the $R^2$ needs to be for one to deduce there is some nontrivial correlation between two variables.

Best Answer

This problem seems simple...but its not. For example, see here for a rather complex analysis for the prima facie simple case of ratios of normal rv and ratios of sums of uniforms.

In general, if your pairs are not from a bivariate gaussian, there is no nice formula for $E[R^2]$.

Note:

$$R_n=\frac{n\sum x_iy_i-\sum x_i\sum y_i}{n^2s_Xs_Y}$$

This mess will have some distribution $f_{R_n}(r)$ that will be very sensitive to $n$.

I think your best bet is to simulate this (Monte Carlo) for $n\in [2....N]$ using a large number of trials (you can check convergence by running each simulation twice, with randomly chosen seeds and comparing these results to each other and to results from $n-1$).

Once you have this data, you can fit a curve to the it or some transformation thereof. Your general equation looks reasonable in terms of how the curve will look, since:

$$E[R^2_n] \xrightarrow{p} 0$$ for correlations between independent variables

Possible Solution

Since your variables are independent, I realized that we are really looking for the variance of the sample correlation (i.e., the square of the expected value of the standard error of the correlation coefficient (see p.6):

$$se_{R_n}=\sqrt{\frac{1-R^2}{n-2}}$$. However, you already know the true value of $R^2$, so you can increase the df in the denominator to get:

But: $R^2=0$ for independent variables, so it reduces to:

$$(se_{R_n})^2=\sigma^2_{R_n}=E[R^2_n]=\frac{1}{n-1}$$

There you have it...it matches your empirical results. As per Wolfies, I should note that this is an asymptotic result, but sums of uniform RVs generally exhibit good convergence properties ala CLT, so this may explain the good fit.

For further reading, see @soakley's nice reference. I was able to pull the relevant page from JSTOR:

enter image description here

or, if you're really motivated, you can get this recent article (2005) about your exact problem.