Solved – Correlation measure between binary variables

binary datacorrelationmatching

I perform a experiment on which the measured results are codes into two variables:
$y_1 \in {-1,0,1}$ and $y_2 \in {-1,0,1}$.

While $-1$ and $1$ are opposite result, $0$ is coding a bad (uncertain) measurement, and is very unlikely, so in fact both of $y_1, y_2$ are almost binary.

How do I check the correlation (or other matching) between $y_1$ and $y_2$?

For example, Pearson is quiet bad in describing this phenomena, since in case of following counts:

$y_1=[-1,-1,-1,-1,-1,-1,-1,-1,0,-1,1,1]$

$y_2=[-1,-1,1,-1,-1,-1,-1,-1,1,-1,-1,1]$

It shows a bad correlation ($R^2= 0.196$) when in fact, there is a good match (9/12) between the cross-counts of the two.

On the other side, if I use some binary similarity measure such as Hamming similarity, I'll get a $0.75$ similarity (only 3/12 mismatches), but in this measure I'm missing the significance (what is the P-value of matching?).

Any help in choosing the right measure of significant matching?

Best Answer

You can actually compute the p-value of any similarity you like, in a similar fashion as the p-value for the Pearson correlation is computed.

The simplest way is to do a permutation test or a bootstrap. A good description can be found here.

Permutation test

In the permutation test, you can:

1) randomly shuffle your two vectors and compute the Hamming similarity.

2) repeat (1) many times (say N = 1e6) and measure the number of times X that the random similarity is larger than the value you observed (0.75).

Your p-value (one-sided) will simply be the number X/N (the proportion of times the random vectors were more similar than the original vectors).

Bootstrapping

As an alternative you can use bootstrapping. The difference is in step (1). Instead of shuffling the vectors, you sample their values with replacement. For example, to randomise this vector:

y1 =[−1,−1,−1,−1,−1,−1,−1,−1,0,−1,1,1]

you produce a vector of 12 entries sampled with probabilities:

p(0) = 1/12

p(1) = 2/12

p(-1) = 9/12

This will preserve the proportions of 1,0 and -1 on average, while the permutation test will preserve that exactly (that's why that is also called exact test).

If you want to test another null hypothesis, you can also sample the values with uniform probability. That will likely give you smaller p-values.

Related Solutions

Solved – What’s the formula of normalized correlation

I haven't come across this usage, but it seems easy to decode.

Matters may differ in your field, but within mainstream statistics, and all statistics-using disciplines I know about, correlation is understood as being by definition scaled to fall within [-1, 1]. When calculated similarly to your formula correlation is a cosine.

So the term "normalized" is just emphasizing that fact; it is not flagging a special case.

The unnormalized correlation would just be called the covariance.

So, you can't find this term being used because it is very unusual.

Correlation – Identifying the Name of This Correlation/Association Measure Between Binary Variables

Using a,b,c,d convention of the 4-fold table, as here,

               Y
             1   0
            -------
        1  | a | b |
     X      -------
        0  | c | d |
            -------
a = number of cases on which both X and Y are 1
b = number of cases where X is 1 and Y is 0
c = number of cases where X is 0 and Y is 1
d = number of cases where X and Y are 0
a+b+c+d = n, the number of cases.

substitute and get

$1-\frac{2(b+c)}{n} = \frac{n-2b-2c}{n} = \frac{(a+d)-(b+c)}{a+b+c+d}$ = Hamann similarity coefficient. Meet it e.g. here. To cite:

Hamann similarity measure. This measure gives the probability that a characteristic has the same state in both items (present in both or absent from both) minus the probability that a characteristic has different states in the two items (present in one and absent from the other). HAMANN has a range of −1 to +1 and is monotonically related to Simple Matching similarity (SM), Sokal & Sneath similarity 1 (SS1), and Rogers & Tanimoto similarity (RT).

You might want to compare the Hamann formula with that of phi correlation (that you mention) given in a,b,c,d terms. Both are "correlation" measures - ranging from -1 to 1. But look, Phi's numerator $ad-bc$ will approach 1 only when both a and d are large (or likewise -1, if both b and c are large): product, you know... In other words, Pearson correlation, and especially its dichotomous-data hypostasis, Phi, is sensitive to the symmetry of marginal distributions in the data. Hamann's numerator $(a+d)-(b+c)$, having sums in place of products, is not sensitive to it: either of two summands in a pair being large is enough for the coefficient to attain close to 1 (or -1). Thus, if you want a "correlation" (or quasi-correlation) measure defying marginal distributions shape - choose Hamann over Phi.

Illustration:

Crosstabulations:
        Y
X    7     1
     1     7
Phi = .75; Hamann = .75

        Y
X    4     1
     1    10
Phi = .71; Hamann = .75