Solved – Relationship between the phi, Matthews and Pearson correlation coefficients

bernoulli-distributionconfusion matrixcontingency tablescorrelationmodel-evaluation

Are the phi and Matthews correlation coefficients the same concept? How are they related or equivalent to Pearson correlation coefficient for two binary variables? I assume the binary values are 0 and 1.

The Pearson's correlation between two Bernoulli random variables $x$ and $y$ is:

$$ \rho = \frac{\mathbb{E} [(x – \mathbb{E}[x])(y – \mathbb{E}[y])]} {\sqrt{\text{Var}[x] \, \text{Var}[y]}}
= \frac{\mathbb{E} [xy] – \mathbb{E}[x] \, \mathbb{E}[y]}{\sqrt{\text{Var}[x] \, \text{Var}[y]}}
= \frac{n_{1 1} n – n_{1\bullet} n_{\bullet 1}}{\sqrt{n_{0\bullet}n_{1\bullet} n_{\bullet 0}n_{\bullet 1}}} $$

where

$$ \mathbb{E}[x] = \frac{n_{1\bullet}}{n} \quad
\text{Var}[x] = \frac{n_{0\bullet}n_{1\bullet}}{n^2} \quad
\mathbb{E}[y] = \frac{n_{\bullet 1}}{n} \quad
\text{Var}[y] = \frac{n_{\bullet 0}n_{\bullet 1}}{n^2} \quad
\mathbb{E}[xy] = \frac{n_{11}}{n} $$

Phi coefficient from Wikipedia:

In statistics, the phi coefficient (also referred to as the "mean square contingency coefficient" and denoted by $\phi$ or $r_\phi$) is a measure of association for two binary variables introduced by Karl Pearson. This measure is similar to the Pearson correlation coefficient in its interpretation. In fact, a Pearson correlation coefficient estimated for two binary variables will return the phi coefficient…

If we have a 2 × 2 table for two random variables $x$ and $y$

The phi coefficient that describes the association of $x$ and $y$ is
$$ \phi = \frac{n_{11}n_{00} – n_{10}n_{01}}{\sqrt{n_{1\bullet}n_{0\bullet}n_{\bullet0}n_{\bullet1}}} $$

Matthews correlation coefficient from Wikipedia:

The Matthews correlation coefficient (MCC) can be calculated directly from the confusion matrix using the formula:
$$ \text{MCC} = \frac{ TP \times TN – FP \times FN } {\sqrt{ (TP + FP) (TP + FN) (TN + FP) (TN + FN) } } $$

In this equation, TP is the number of true positives, TN the number of true negatives, FP the number of false positives and FN the number of false negatives. If any of the four sums in the denominator is zero, the denominator can be arbitrarily set to one; this results in a Matthews correlation coefficient of zero, which can be shown to be the correct limiting value.

Best Answer

Yes, they are the same. The Matthews correlation coefficient is just a particular application of the Pearson correlation coefficient to a confusion table.

A contingency table is just a summary of underlying data. You can convert it back from the counts shown in the contingency table to one row per observations.

Consider the example confusion matrix used in the Wikipedia article with 5 true positives, 17 true negatives, 2 false positives and 3 false negatives

> matrix(c(5,3,2,17), nrow=2, byrow=TRUE)
     [,1] [,2]
[1,]    5    3
[2,]    2   17
> 
> # Matthews correlation coefficient directly from the Wikipedia formula
> (5*17-3*2) / sqrt((5+3)*(5+2)*(17+3)*(17+2))
[1] 0.5415534
> 
> 
> # Convert this into a long form binary variable and find the correlation coefficient
> conf.m <- data.frame(
+ X1=rep(c(0,1,0,1), c(5,3,2,17)),
+ X2=rep(c(0,0,1,1), c(5,3,2,17)))
> conf.m # what does that look like?
   X1 X2
1   0  0
2   0  0
3   0  0
4   0  0
5   0  0
6   1  0
7   1  0
8   1  0
9   0  1
10  0  1
11  1  1
12  1  1
13  1  1
14  1  1
15  1  1
16  1  1
17  1  1
18  1  1
19  1  1
20  1  1
21  1  1
22  1  1
23  1  1
24  1  1
25  1  1
26  1  1
27  1  1
> cor(conf.m)
          X1        X2
X1 1.0000000 0.5415534
X2 0.5415534 1.0000000

Related Solutions

Solved – Basis of Pearson correlation coefficient

What matter is $cov(X,Y)$. Denominator $\sqrt{var(X)var(Y)}$ is for getting rid of units of measure (if say $X$ is measured in meters and $Y$ in kilograms then $cov(X,Y)$ is measured in meter-kilograms which is hard to comprehend) and for standardization ($cor(X,Y)$ lies between -1 and 1 whatever variable values you have).

Now back to $cov(X,Y)$. This shows how variables vary together about their means, hence co-variance. Let us take an example. enter image description here

Lines are drawn at sample means $\bar X$ and $\bar Y$. The points in the upper right corner are where both $X_i$ and $Y_i$ are above their means and so both $(X_i-\bar X)$ and $(Y_i-\bar Y)$ are positive. The points in the lower left corner are below their means. In both cases product $(X_i-\bar X)(Y_i-\bar Y)$ is positive. On the contrary upper left and lower right are areas where this product is negative.

Now when computing covariance $cov(X,Y)=\frac1{n-1}\sum_{i=1}^n(X_i-\bar X)(Y_i-\bar Y)$ in this example points that give positive products $(X_i-\bar X)(Y_i-\bar Y)$ dominate, resulting positive covariance. This covariance is bigger when points are aligned closer to an imaginable line crossing the point $(\bar X,\bar Y)$.

As a last note, covariance shows only the strength of a linear relationship. If relationship is non linear, covariance is not able to detect it.

Solved – Why does the Phi coefficient approximates the Pearson’s correlation

By default, chisq.test() applies a continuity correction when computing the test statistic for 2x2 tables. If you switch off this behavior, then:

x = c(1,  1,  0,  0,  1,  0,  1,  1,  1)
y = c(1,  1,  0,  0,  0,  0,  1,  1,  1)
cor(x,y)
sqrt(chisq.test(table(x,y), correct=FALSE)$statistic/length(x)) # phi

will give you exactly the same answer. And this essentially also answers why $\sqrt{\chi^2/n}$ with the continuity correction approximates cor(x,y) -- as $n$ increases, the continuity correction has less and less influence on the result.

The continuity correction is described here: Yates's correction for continuity

Best Answer

Related Solutions

Solved – Basis of Pearson correlation coefficient

Solved – Why does the Phi coefficient approximates the Pearson’s correlation

Related Question