Correlation – Identifying the Name of This Correlation/Association Measure Between Binary Variables

association-measurebinary datacorrelation

There are several measures of association (or contingency or correlation) between two binary random variables $X$ and $Y$, among others

I wonder how the following number $\kappa$ relates to known measures, if it is statistically interesting, and under which name it is (possibly) discussed:

$$\kappa = 1 – \frac{2}{N}|X \triangle Y|$$

with $|X \triangle Y|$ the number of samples having property $X$ or property $Y$ but not both (exclusive OR, symmetric difference), $N$ the total number of samples. Like the phi coefficient, $\kappa = ± 1$ indicates perfect agreement or disagreement, and $\kappa = 0$ indicates no relationship

Best Answer

Using a,b,c,d convention of the 4-fold table, as here,

               Y
             1   0
            -------
        1  | a | b |
     X      -------
        0  | c | d |
            -------
a = number of cases on which both X and Y are 1
b = number of cases where X is 1 and Y is 0
c = number of cases where X is 0 and Y is 1
d = number of cases where X and Y are 0
a+b+c+d = n, the number of cases.

substitute and get

$1-\frac{2(b+c)}{n} = \frac{n-2b-2c}{n} = \frac{(a+d)-(b+c)}{a+b+c+d}$ = Hamann similarity coefficient. Meet it e.g. here. To cite:

Hamann similarity measure. This measure gives the probability that a characteristic has the same state in both items (present in both or absent from both) minus the probability that a characteristic has different states in the two items (present in one and absent from the other). HAMANN has a range of −1 to +1 and is monotonically related to Simple Matching similarity (SM), Sokal & Sneath similarity 1 (SS1), and Rogers & Tanimoto similarity (RT).

You might want to compare the Hamann formula with that of phi correlation (that you mention) given in a,b,c,d terms. Both are "correlation" measures - ranging from -1 to 1. But look, Phi's numerator $ad-bc$ will approach 1 only when both a and d are large (or likewise -1, if both b and c are large): product, you know... In other words, Pearson correlation, and especially its dichotomous-data hypostasis, Phi, is sensitive to the symmetry of marginal distributions in the data. Hamann's numerator $(a+d)-(b+c)$, having sums in place of products, is not sensitive to it: either of two summands in a pair being large is enough for the coefficient to attain close to 1 (or -1). Thus, if you want a "correlation" (or quasi-correlation) measure defying marginal distributions shape - choose Hamann over Phi.

Illustration:

Crosstabulations:
        Y
X    7     1
     1     7
Phi = .75; Hamann = .75

        Y
X    4     1
     1    10
Phi = .71; Hamann = .75
Related Question