Solved – Pearson correlation coefficient is a measure of linear correlation – proof

correlationpearson-r

Is it possible to prove that Pearson coefficient is a measure of linear correlation between two variables?

I think I've seen proofs somewhere that if the relationship between variables $X, Y$ is given by linear equation $Y=aX+B$, then the absolute value of the Pearson coefficient equals $1$.

How to show that $0$ means they are not correlated at all, and intermediate values like $0.4$ suggest they are somehow more correlated, whatever that means? I think we must define first what is meant by stronger and weaker linear correlation between variables (and my question only makes sense if the Pearson coefficient itself is not this definition).

For instance, expected value is defined by a certain formula, and the law of large numbers theorem in a way 'proves' its intuitive meaning. I'm not aware of a proof that variance measures the spread of data, or Pearson coefficient measures the linear correlation.

Best Answer

It is indeed possible to show that the Pearson correlation is essentially the way to measure linearity of association when you elect to use standard deviations to measure the dispersion of random variables.


Let's begin by noting that the Pearson correlation $\rho$ has to be considered as a (rather large) equivalence class of properties of bivariate random variables, because any invertible monotonic re-expression of it (such as its exponential) will carry identical information.

The question concerns what the term "linear relationship" might mean for random variables $(X,Y)$ that are not collinear. An important special case of this occurs when $(X,Y)$ is the empirical distribution of any bivariate dataset (which we may think of in terms of its scatterplot): are there natural ways to measure the departure of such a dataset (or scatterplot) from "linearity"?

Note that any invertible linear re-expression of the variables

$$(\xi,\eta) = (aX + b, cY + d)$$

(for constant $a,b,c,d$ where $a$ and $c$ are positive) will not change the linearity of their relationship (or lack thereof). We may therefore adopt some measure of "center" of a variable (such as its mean or median) and a measure of its "dispersion" around that center (such as its standard deviation or interquartile range) and stipulate that $a,b,c,d$ be chosen to place the centers of $\xi$ and $\eta$ at $0$ and scale their dispersions to unity.

If $(X,Y)$ are linearly related, this initial standardization will cause the support of $(\xi,\eta)$ to lie either on the line $\xi=\eta$ (for a positive relationship) or $\xi=-\eta$ (for a negative relationship). In the former case, the dispersion of $\xi-\eta$ is a natural measure of deviation from the line, while in the latter case the dispersion of $\xi+\eta$ measures deviation from the line. As a quantitative measurement of linearity, we may therefore compare one of these two quantities to the other. The larger this value is in size, the more linear is the original relationship between $X$ and $Y$.

Because two positive quantities are to be compared and a universal (unitless) measure is sought, the simplest way to make the comparison is the ratio. While the distributions of ratios tend to be skewed, the distributions of their logarithms tend not to be. Moreover, although a ratio cannot be negative, its logarithm potentially could be any real number.

Accordingly, linearity of relationship ought to be measured as the log ratio of dispersions of the sum and difference of the standardized variables.


As an example, consider using the first two moments to measure the center (the mean) and the dispersion (the standard deviation). The associated measure of linearity of $(X,Y)$ is

$$Z = \log \frac{\operatorname{SD}(\xi + \eta)}{\operatorname{SD} (\xi-\eta)} = \frac{1}{2}\log\frac{\operatorname{Var}(\xi+\eta)}{\operatorname{Var}(\xi-\eta)} = \frac{1}{2}\log \frac{2 + 2\rho}{2 - 2\rho} = \frac{1}{2}\log \frac{1 + \rho}{1 - \rho}$$

where $\rho$ is the Pearson correlation of $(X,Y)$. This expression for $Z$ is recognizable as the Fisher transformation of $\rho$, and therefore is equivalent to $\rho$ for assessing linearity. It is pleasing to see it drop out automatically from such basic principles.


This derivation has shown that the Pearson correlation coefficient is the natural way to measure the linearity of any bivariate distribution $(X,Y)$ when the first two moments are employed to evaluate central tendency and dispersion of variables.

One can go further and demonstrate that, among all the possible invertible monotonic transformations of $Z$, $\rho = \tanh(z)$ enjoys a special relationship to measures of linearity in simple ordinary least squares (OLS) regression: $\rho^2$ is identical to the coefficient of determination, $R^2$ in the regression of $Y$ against $X$ and in the regression of $X$ against $Y$. This is why $\rho$ rather than $Z$ is most often used, even in non-regression settings.

Related Question