Going through the Wiki article on the Phi coefficient, I've noticed that for paired binary data "a Pearson correlation coefficient estimated for two binary variables will return the phi coefficient".
Upon running a quick simulation I found this to not be the case. However, it appears that the phi coefficient does approximate the pearson's correlation coefficient.
x <- c(1, 1, 0, 0, 1, 0, 1, 1, 1)
y <- c(1, 1, 0, 0, 0, 0, 1, 1, 1)
cor(x,y)
sqrt(chisq.test(table(x,y))$statistic/length(x)) # phi
x <- rep(x, 1000)
y <- rep(y, 1000)
sqrt(chisq.test(table(x,y))$statistic/length(x)) # phi
# it now DOES approximates the pearsons correlation.
cor(x,y)
But it is not apparent to me why (mathematically) this is the case.
Best Answer
By default,
chisq.test()
applies a continuity correction when computing the test statistic for 2x2 tables. If you switch off this behavior, then:will give you exactly the same answer. And this essentially also answers why $\sqrt{\chi^2/n}$ with the continuity correction approximates
cor(x,y)
-- as $n$ increases, the continuity correction has less and less influence on the result.The continuity correction is described here: Yates's correction for continuity