Solved – Why does the Phi coefficient approximates the Pearson’s correlation

correlationr

Going through the Wiki article on the Phi coefficient, I've noticed that for paired binary data "a Pearson correlation coefficient estimated for two binary variables will return the phi coefficient".

Upon running a quick simulation I found this to not be the case. However, it appears that the phi coefficient does approximate the pearson's correlation coefficient.

x <- c(1,   1,  0,  0,  1,  0,  1,  1,  1)
y <- c(1,   1,  0,  0,  0,  0,  1,  1,  1)
cor(x,y)
sqrt(chisq.test(table(x,y))$statistic/length(x)) # phi

x <- rep(x, 1000)
y <- rep(y, 1000)
sqrt(chisq.test(table(x,y))$statistic/length(x)) # phi
# it now DOES approximates the pearsons correlation.
cor(x,y)

But it is not apparent to me why (mathematically) this is the case.

Best Answer

By default, chisq.test() applies a continuity correction when computing the test statistic for 2x2 tables. If you switch off this behavior, then:

x = c(1,  1,  0,  0,  1,  0,  1,  1,  1)
y = c(1,  1,  0,  0,  0,  0,  1,  1,  1)
cor(x,y)
sqrt(chisq.test(table(x,y), correct=FALSE)$statistic/length(x)) # phi

will give you exactly the same answer. And this essentially also answers why $\sqrt{\chi^2/n}$ with the continuity correction approximates cor(x,y) -- as $n$ increases, the continuity correction has less and less influence on the result.

The continuity correction is described here: Yates's correction for continuity