Solved – Paired test for comparing boolean data

binary datahypothesis testingpaired-comparisonsstatistical significance

I have $n$ individuals, and for each individual, I have two measurements using two devices (device X and device Y). I know the ground truth for the correct measurement, and I can classify each measurement as accurate or inaccurate. Thus, for each individual I effectively have a boolean value that indicates whether device X was correct or not (say $x_i$) and a boolean value that indicates whether device Y was correct or not (say $y_i$).

Is there a good statistical test to use to compare the accuracy rate of the two devices?

In particular, suppose I notice that device X's accuracy rate appears to be higher than device Y's accuracy rate, based upon the $n$ observations (i.e., $(x_1+\dots+x_n)/n > (y_1+\dots+y_n)/n$, where $x_i,y_i = 1$ means it was correct and $0$ means it was incorrect). Now I'd like to test whether the difference in observed accuracy rate is statistically significant. Can I compute a $p$-value for the null hypothesis that their underlying accuracy rate is actually the same?

Should I use the Wilcoxon signed-rank test? A paired Student's t-test? Some sort of paired Welch t-test (does such a thing even exist)? None of those seems like an obvious fit to me: I know the data isn't normally distributed (it presumably has a Bernoulli distribution), so a t-test isn't perfect (on the other hand I've read that in practice the t-test is fairly robust to deviations from normality so maybe it is OK?); and I can't tell whether a Wilcoxon signed-rank test takes into account the prior knowledge that the data is Bernoulli distributed. Anyway, what would be the most appropriate methodology?

Best Answer

McNemar's test solves this problem. (Thanks to Glen_b for mentioning this!) It is intended for paired data, where the observations are boolean -- a perfect fit. It is also easy to compute, which is convenient.

See also Paired t-test for binary data for another instance of a closely related statistical hypothesis testing problem.