Pearson Correlation – Why It Equals 1 with Only Two Data Values

correlationmatrixpearson-rr

I am trying to obtain a Pearson correlation between 6 different variables (represented by columns in the matrix below) with two datapoints each (rows).
This is the matrix:

     scer       bay      par       mik      glab       lac
var1 2.2273444 2.0923416 2.044007 1.7664921 1.3832924 2.4294228
var2 0.3000878 0.2792936 0.286928 0.3246768 0.4946222 0.3083171 

When I apply the standard R code for correlation:

cor(mat)

I obtain the following result:

     scer bay par mik glab lac
scer    1   1   1   1    1   1
bay     1   1   1   1    1   1
par     1   1   1   1    1   1
mik     1   1   1   1    1   1
glab    1   1   1   1    1   1
lac     1   1   1   1    1   1

If I add another two rows to the original matrix:

                scer       bay       par       mik      glab       lac
var1    2.2273444 2.0923416 2.0440068 1.7664921 1.3832924 2.4294228
var2    0.3000878 0.2792936 0.2869280 0.3246768 0.4946222 0.3083171
var3    1.1399738 1.2899311 1.1071462 1.0180361 1.4507592 2.4078977
var4    0.7107440 0.6415944 0.7197905 0.7357125 0.4571745 0.3173547

and re-execute the above code with the new matrix, I obtain a more familiar result:

          scer       bay       par       mik      glab       lac
scer 1.0000000 0.9895959 0.9991065 0.9967358 0.7860344 0.8246286
bay  0.9895959 1.0000000 0.9916464 0.9890492 0.8647974 0.8958393
par  0.9991065 0.9916464 1.0000000 0.9991332 0.7928330 0.8310776
mik  0.9967358 0.9890492 0.9991332 1.0000000 0.7845007 0.8235245
glab 0.7860344 0.8647974 0.7928330 0.7845007 1.0000000 0.9978420
lac  0.8246286 0.8958393 0.8310776 0.8235245 0.9978420 1.0000000

Why does the correlation function return a matrix of 1s if I use 2 values?

Best Answer

Correlation, meaning Pearson correlation, can be thought of as a numerical answer to the question: Is there a linear relationship between two variables?

If you have two distinct data points, the only possible correlation result is $+1$ or $-1$, because two such points define a perfect linear relationship.

This matches the observation that a straight line can be found to interpolate two distinct points exactly.

The only choice is between a rising and a falling straight line, which give $+1$ or $-1$ respectively.

(If your two points are identical on either of the two variables, the correlation is indeterminate.)

In scientific terms, a correlation involving just two points is useless by itself.

Related Question