Solved – Correlation measure between binary variables

binary datacorrelationmatching

I perform a experiment on which the measured results are codes into two variables:
$y_1 \in {-1,0,1}$ and $y_2 \in {-1,0,1}$.

While $-1$ and $1$ are opposite result, $0$ is coding a bad (uncertain) measurement, and is very unlikely, so in fact both of $y_1, y_2$ are almost binary.

How do I check the correlation (or other matching) between $y_1$ and $y_2$?

For example, Pearson is quiet bad in describing this phenomena, since in case of following counts:

$y_1=[-1,-1,-1,-1,-1,-1,-1,-1,0,-1,1,1]$

$y_2=[-1,-1,1,-1,-1,-1,-1,-1,1,-1,-1,1]$

It shows a bad correlation ($R^2= 0.196$) when in fact, there is a good match (9/12) between the cross-counts of the two.

On the other side, if I use some binary similarity measure such as Hamming similarity, I'll get a $0.75$ similarity (only 3/12 mismatches), but in this measure I'm missing the significance (what is the P-value of matching?).

Any help in choosing the right measure of significant matching?

Best Answer

You can actually compute the p-value of any similarity you like, in a similar fashion as the p-value for the Pearson correlation is computed.

The simplest way is to do a permutation test or a bootstrap. A good description can be found here.

Permutation test

In the permutation test, you can:

1) randomly shuffle your two vectors and compute the Hamming similarity.

2) repeat (1) many times (say N = 1e6) and measure the number of times X that the random similarity is larger than the value you observed (0.75).

Your p-value (one-sided) will simply be the number X/N (the proportion of times the random vectors were more similar than the original vectors).

Bootstrapping

As an alternative you can use bootstrapping. The difference is in step (1). Instead of shuffling the vectors, you sample their values with replacement. For example, to randomise this vector:

y1 =[−1,−1,−1,−1,−1,−1,−1,−1,0,−1,1,1]

you produce a vector of 12 entries sampled with probabilities:

p(0) = 1/12

p(1) = 2/12

p(-1) = 9/12

This will preserve the proportions of 1,0 and -1 on average, while the permutation test will preserve that exactly (that's why that is also called exact test).

If you want to test another null hypothesis, you can also sample the values with uniform probability. That will likely give you smaller p-values.

Related Question