Solved – Basis of Pearson correlation coefficient

correlationpearson-r

Pearson correlation coefficient is calculated using the formula $r = \frac{cov(X,Y)}{\sqrt{var(X)} \sqrt{var(Y)}}$. How does this formula contain the information that the two variates $X$ and $Y$ are correlated or not? Or, how do we get this formula for the correlation coefficient?

Best Answer

What matter is $cov(X,Y)$. Denominator $\sqrt{var(X)var(Y)}$ is for getting rid of units of measure (if say $X$ is measured in meters and $Y$ in kilograms then $cov(X,Y)$ is measured in meter-kilograms which is hard to comprehend) and for standardization ($cor(X,Y)$ lies between -1 and 1 whatever variable values you have).

Now back to $cov(X,Y)$. This shows how variables vary together about their means, hence co-variance. Let us take an example.enter image description here

Lines are drawn at sample means $\bar X$ and $\bar Y$. The points in the upper right corner are where both $X_i$ and $Y_i$ are above their means and so both $(X_i-\bar X)$ and $(Y_i-\bar Y)$ are positive. The points in the lower left corner are below their means. In both cases product $(X_i-\bar X)(Y_i-\bar Y)$ is positive. On the contrary upper left and lower right are areas where this product is negative.

Now when computing covariance $cov(X,Y)=\frac1{n-1}\sum_{i=1}^n(X_i-\bar X)(Y_i-\bar Y)$ in this example points that give positive products $(X_i-\bar X)(Y_i-\bar Y)$ dominate, resulting positive covariance. This covariance is bigger when points are aligned closer to an imaginable line crossing the point $(\bar X,\bar Y)$.

As a last note, covariance shows only the strength of a linear relationship. If relationship is non linear, covariance is not able to detect it.