Solved – Pearson correlation between a variable and its square

correlationlinearpearson-r

Here is my R code to get familiarised with Pearson's correlation. I generate values of $X$ from 1 to 100, then find the correlation between $X$ and $X^2$:

x=1:100
y=x
for(i in 1:100) {y[i]=x[i]*x[i]}
cor.test(x,y, type="pearson")

I get this result :

Pearson's product-moment correlation

data:  x and y
t = 38.668, df = 98, p-value < 2.2e-16
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.9538354 0.9789069
sample estimates:
      cor 
0.9687564 

$r$ seems high to me.

My question is: what exactly does the $r$ coefficient quantify? Does it only quantify the closeness of the relationship between $X$ and $Y$ variable to a linear relationship ?

Or is it also suited to quantify the intensity of a relationship between $X$ and $Y$ broadly speaking (whether this relationship is close to linearity or not)?

My last question is: are there other correlation test better suited than Pearson's test to quantify the intensity of the relationship between two given variables when the kind (linear, quadratic, exponential, etc.) of this relationship is not known a priori or is Pearson's test sufficient to do this kind of job?

Best Answer

You are curious about whether your value of $r$ is "too high" — it seems you think that, as $X$ and $X^2$ do not have an exactly linear relationship, then the Pearson's $r$ should be rather low. The high $r$ is not telling you that the relationship is linear, but it is telling you that the relationship is rather close to being linear.

If you are specifically interested in the case where $X$ is uniform, you might want to look at this thread on Math SE on the covariance between a uniform distribution and its square. You are using discrete uniform distribution $1,2,\dots,n$ but if you rescaled $X$ by a factor of $1/n$, and hence rescaled $X^2$ by a factor $1/n^2$, the correlation would be unchanged (since correlation is not affected by rescaling by a positive scale factor). You would now have a discrete uniform distribution with equal probability masses on $\frac{1}{n}, \frac{2}{n}, \dots, \frac{n-1}{n}, 1$. For large values of $n$, this approximates a continuous uniform distribution (also called "rectangular distribution") on $[0,1]$.

By an argument analogous to that on the Math SE thread, we have:

$$\operatorname{Cov}(X,X^2) = \mathbb{E}(X^3)-\mathbb{E}(X)\mathbb{E}(X^2) = \int_0^1 x^3 dx - \int_0^1 x dx \cdot \int_0^1 x^2 dx$$

This integrates to $\frac{1}{4} - \frac{1}{2} \cdot \frac{1}{3} = \frac{1}{12}$.

We also have $\operatorname{Var}(X) = \mathbb{E}(X^2)-\mathbb{E}(X)^2 = \frac{1}{3} - \left(\frac{1}{2}\right)^2 = \frac{1}{12}$.

Similarly we find $\operatorname{Var}(X^2) = \mathbb{E}(X^4)-\mathbb{E}(X^2)^2 = \frac{1}{5} - \left(\frac{1}{3}\right)^2 = \frac{4}{45}$.

Hence, if $X \sim U(0,1)$, then:

$$\operatorname{Corr}(X,X^2) = \frac{\operatorname{Cov}(X,X^2)}{\sqrt{\operatorname{Var}(X) \cdot \operatorname{Var}(X^2)}} = \frac{\frac{1}{12}}{\sqrt{{\frac{1}{12}}\cdot{\frac{4}{45}}}} = \frac{\sqrt{15}}{4}$$

To seven decimal places, this is $r = 0.96824583$, even though the relationship is quadratic rather than linear. Now you have taken a discrete uniform distribution on $1, 2, \dots, n$ rather than a continuous one, but for the reasons explained above, increasing $n$ will produce a correlation closer to the continuous case, so that $\sqrt{15}/4$ will be the limiting value. Let us confirm this in R:

corn <- function(n){
  x = 1:n
  cor(x,x^2)
}

> corn(2)
[1] 1
> corn(3)
[1] 0.9897433
> corn(4)
[1] 0.984374
> corn(5)
[1] 0.9811049
> corn(10)
[1] 0.9745586
> corn(100)
[1] 0.9688545
> corn(1e3)
[1] 0.9683064
> corn(1e6)
[1] 0.9682459
> corn(1e7)
[1] 0.9682458

That correlation of $r=0.9682458$ may sound surprisingly high, but if we inspected a graph of the relationship between $X$ and $X^2$ it would indeed appear approximately linear, and this is all that the correlation coefficient is telling you. Moreover, we can see from our table of output from the corn function that increasing the value of $n$ makes the linear correlation smaller (note that with two points, we had a perfect linear fit and a correlation equal to one!) but that although $r$ is falling, it is bounded below by $\sqrt{15}/4$. In other words, increasing the length of your sequence of integers makes the linear fit somewhat worse, but even as $n$ tends to infinity your $r$ never becomes worse than $0.9682\dots$.

x=1:100; y=x^2
plot(x,y)
abline(lm(y~x))

scatter plot of uniform x and its square

Perhaps visually you are still not convinced that the correlation looks as strong as the calculated coefficient suggests — clearly the points are below the line of best fit for low and high values of $X$, and above it for intermediate $X$. If it can't capture this quadratic curvature, is the line really such a good fit to the points?

You may find it helpful to compare the overall variation of the $Y$ coordinates about their own mean (the "total variation") to how much the points vary above and below the regression line (the "residual variation" that the regression line was unable to explain). The fraction of the residual variation over the total variation tells you what proportion of the variation was not explained by the regression line; the proportion of variation that is explained by the regression line is then one minus this fraction, and is called the $R^2$. In this case, we can see that the variation of points above and below the line is relatively small compared to the variation in their $Y$ coordinates, and so the proportion unexplained by the regression is small and the $R^2$ is large. It turns out that for a simple linear regression, $R^2$ is equal to the square of the Pearson correlation. In fact $r=\sqrt{R^2}$ if the regression slope is positive (an increasing relationship) or $r=-\sqrt{R^2}$ if the slope is negative (decreasing).

We had a large $R^2$ so our correlation is large also. This is the sense we mean when we state that "a Pearson correlation near $\pm 1$ indicates the linear fit is good" — not that our straight regression line captures the true nature of the relationship between $X$ and $Y$, and so there is no curvature and no discernible pattern in the residual variation, but instead that the line provides a good approximation to the true relationship, and that the proportion of residual variation (i.e. that part left unexplained by the linear model) is small.

Note that had you chosen a discrete uniform on e.g. $-100, -99, \dots, 99, 100$ and rescaled that to being between $[-1,1]$ and you would have found a covariance and correlation of zero, as happens in the linked Math SE thread. There is neither an increasing nor decreasing relationship.

x=-100:100; y=x^2
plot(x,y)
abline(lm(y~x))

scatter plot of uniform x with negative and positive values, and its square

As an exercise to think through, what would be the correlation between $-1, -2, -3, \dots, -n$ and its squares? You can easily write some R code to confirm your guess.

If all you care about is the existence of an increasing or decreasing relationship, rather than the extent to which it is linear, you can use a rank-based measure such as Kendall's tau or Spearman's rho, as mentioned in Glen_b's answer. For my first graph, which had a perfectly monotonic increasing relationship, both methods would have given the highest possible correlation (one). For the second graph, which is neither increasing nor decreasing, both would give a correlation of zero.