Solved – Assumptions of correlation coefficient

assumptionscorrelationpearson-r

What are the assumptions for the proper use and interpretation of the Pearson's correlation coefficient?

How is $\DeclareMathOperator{\cov}{cov}\cov(X,Y)$ affected by (mild) deviations of linearity?

How is $\cov(X,Y)$ influenced by the presence of heteroscedasticity?

Best Answer

The only real assumption of Pearson's correlation is that the variables are interval level. There are additional assumptions for tests of whether the correlation is 0, but the correlation is the correlation.

However, the correlation only examines the linear relationship between X and Y. So, while the correlation doesn't assume anything about the variables, it can be misleading in some cases and for some purposes. See the Anscombe Quartet for some extreme examples.

It is similar to the case with the mean: The arithmetic mean doesn't assume anything about the variables except, again, that they are interval. But the mean can be a misleading choice of measure for central location in some cases.

However, it is not necessarily the case that the mean or correlation are poor choices even with oddly distributed data: It depends on what you are trying to measure.

The central lesson is that it is always good to graph your data first. As Yogi Berra said "You can see a lot by looking"