Point-Biserial vs Pearson’s Correlation – Comparing Correlation Methods in R

correlationr

I know that when looking at the correlation between a binary and a continuous variable we should use point biserial correlation.

Today I was looking at some data and mistakenly used Pearson's. When I ran point biserial correlation instead, the coefficient was equal to, but the negative of, Pearson's, which was very strange to me.

mydata <- structure(list(x1 = c(1L, 4L, 1L, 2L, 5L, 6L, 3L, 1L, 5L, 5L, 
                                6L, 6L, 1L, 5L, 5L, 1L, 6L, 5L, 5L, 6L, 3L, 6L, 2L, 2L, 6L, 4L, 
                                1L, 6L, 4L, 1L, 6L, 6L, 6L, 2L, 5L, 2L, 6L, 6L, 6L, 6L, 6L, 5L, 
                                1L, 1L, 6L, 4L, 5L, 5L, 4L, 6L, 5L, 4L, 5L, 5L, 6L, 6L, 2L, 3L, 
                                6L, 5L, 2L, 2L, 3L), 
                         x2 = c(FALSE, TRUE, FALSE, FALSE, TRUE, 
                              TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, TRUE, FALSE, FALSE, FALSE, 
                              TRUE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, FALSE, 
                              FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, FALSE, TRUE, FALSE, 
                              FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, 
                              TRUE, TRUE, TRUE, TRUE, FALSE, FALSE, TRUE, FALSE, TRUE, TRUE, 
                              FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, FALSE)), class = "data.frame")

Then:

> cor(mydata$x1, mydata$x2)
[1] 0.07888117

> ltm::biserial.cor(mydata$x1, mydata$x2)
[1] -0.07888117

Is this expected, or am I missing something ?

Best Answer

Yes, this is expected. In fact, Pearson's product-moment correlation coefficient and the point-biserial correlation coefficient are identical if the same reference level/category of the binary (random) variable is used in the respective calculations.
For your data we get

> cor(mydata$x1, mydata$x2)
[1] 0.07888117
> ltm::biserial.cor(mydata$x1, mydata$x2, level = 2)
[1] 0.07888117

or

> cor(mydata$x1, ifelse(mydata$x2, FALSE, TRUE))
[1] -0.07888117
> ltm::biserial.cor(mydata$x1, mydata$x2, level = 1)
[1] -0.07888117

depending on the reference level.