Solved – One-sided McNemar’s test

exact-testmcnemar-testr

In R, the function mcnemar.test has the following example:

## Agresti (1990), p. 350.
## Presidential Approval Ratings.
##  Approval of the President's performance in office in two surveys,
##  one month apart, for a random sample of 1600 voting-age Americans.
Performance <- matrix(c(794, 86, 150, 570),
                      nrow = 2,
                      dimnames = list("1st Survey" = c("Approve", "Disapprove"),
                                      "2nd Survey" = c("Approve", "Disapprove")))
Performance
mcnemar.test(Performance)
## => significant change (in fact, drop) in approval ratings

I would like to perform a one-sided version, e.g, in the above example, to test whether approval went down. I found a package (exact2x2) that performs two-sided and one-sided test.

exact2x2(Performance, alternative="two.sided", conf.level=0.95, paired=T)
## vs. 
exact2x2(Performance, alternative="greater",   conf.level=0.95, paired=T)

My statistical question is:

How to run a one-sided McNemar's test? What are the mathematical differences between the two?

In addition, I'd like to know:

Why does the base R function not provide a one-sided test option, whereas most other statistical tests provide that option?
Why are the results for the two sided tests different in the package function?

Best Answer

I have described the gist of McNemar's test rather extensively here and here, it may help you to read those. Briefly, McNemar's test assesses the balance of the off-diagonal counts. If people were as likely to transition from approval to disapproval as from disapproval to approval, then the off-diagonal values should be approximately the same. The question then is how to test that they are. Assuming a 2x2 table with the cells labeled "a", "b", "c", "d" (from left to right, from top to bottom), the actual test McNemar came up with is:
$$ Q_{\chi^2} = \frac{(b-c)^2}{(b+c)} $$ The test statistic, which I've called $Q_{\chi^2}$ here, is approximately distributed as $\chi^2_1$, but not quite, especially with smaller counts. The approximation can be improved using a 'continuity correction':
$$ Q_{\chi^2c} = \frac{(|b-c|-1)^2}{(b+c)} $$ This will work better, and realistically, it should be considered fine, but it can't be quite right. That's because the test statistic will necessarily have a discrete sampling distribution, as counts are necessarily discrete, but the chi-squared distribution is continuous (cf., Comparing and contrasting, p-values, significance levels and type I error).

Presumably, McNemar went with the above version due to the computational limitations of his time. Tables of critical chi-squared values were to be had, but computers weren't. Nonetheless, the actual relationship at issue can be perfectly modeled as a binomial:
$$ Q_b = \frac{b}{b+c} $$ This can be tested via a two-tailed test, a one-tailed 'greater than' version, or a one-tailed 'less than' version in a very straightforward way. Each of those will be an exact test.

With smaller counts, the two-tailed binomial version and McNemar's version that compares the quotient to a chi-squared distribution, will differ slightly. 'At infinity', they should be the same.

The reason R cannot really offer a one-tailed version of the standard implementation of McNemar's test is that by its nature, chi-squared is essentially always a one-tailed test (cf., Is chi-squared always a one-sided test?).

If you really want the one-tailed version, you don't need any special package, it's straightforward to code from scratch:

Performance
#             2nd Survey
# 1st Survey   Approve Disapprove
#   Approve        794        150
#   Disapprove      86        570
pbinom(q=(150-1), size=(86+150), prob=.5, lower.tail=FALSE)
# [1] 1.857968e-05
## or:
binom.test(x=150, n=(86+150), p=0.5, alternative="greater")
#   Exact binomial test
# 
# data:  150 and (86 + 150)
# number of successes = 150, number of trials = 236, p-value = 1.858e-05
# alternative hypothesis: true probability of success is greater than 0.5
# 95 percent confidence interval:
#  0.5808727 1.0000000
# sample estimates:
# probability of success 
#              0.6355932

Edit:
@mkla25 pointed out (now deleted) that the original pbinom() call above was incorrect. (It has now been corrected; see revision history for original.) The binomial CDF is defined as the proportion $≤$ the specified value, so the complement is strictly $>$. To use the binomial CDF directly for a "greater than" test, you need to use $(x−1)$ to include the specified value. (To be explicit: this is not necessary to do for a "less than" test.) A simpler approach that wouldn't require you to remember this nuance would be to use binom.test(), which does that for you.

Related Solutions

R – Two-Sample One-Sided Kolmogorov-Smirnov Test vs. One-Sided Wilcoxon-Mann-Whitney Test

Both are testing for displacement of the x variable with respect to the y variable, but the 2 tests have opposite meanings for the term "greater" (and therefor also or "less").

In the ks.test "greater" means that the CDF of 'x' is higher than the CDF of 'y' which means that things like the mean and the median will be smaller values in 'x' than in 'y' if the CDF of 'x' is "greater" than the CDF of 'y'. In 'wicox.test' and 't.test' the mean, median, etc. will be greater in 'x' than in 'y' if you believe that the alternative of "greater" is true.

An example from R:

> x <- rnorm(25)
> y <- rnorm(25, 1)
> 
> ks.test(x,y, alt='greater')

        Two-sample Kolmogorov-Smirnov test

data:  x and y 
D = 0.6, p-value = 0.0001625
alternative hypothesis: two-sided 

> wilcox.test( x, y, alt='greater' )

        Wilcoxon rank sum test

data:  x and y 
W = 127, p-value = 0.9999
alternative hypothesis: true location shift is greater than 0 

> wilcox.test( x, y, alt='less' )

        Wilcoxon rank sum test

data:  x and y 
W = 127, p-value = 0.000101
alternative hypothesis: true location shift is less than 0

Here I generated 2 samples from a normal distribution, both with sample size 25 and standard deviation of 1. The x variable comes from a distribution of mean 0 and the y variable from a distribution of mean 1. You can see the results of ks.test give a very significant result testing in the "greater" direction even though x has the smaller mean, this is because the CDF of x is above that of y. The wilcox.test function shows lack of significance in the "greater" direction, but similar level of significance in the "less" direction.

Both tests are different approaches to testing the same idea, but what "greater" and "less" mean to the 2 tests are different (and conceptually opposite).

R Wilcoxon Signed Rank Test – Understanding One-Tailed Wilcoxon Signed Rank Test Output in R

What does having infinity as the upper bound of a confidence interval mean? Is this because I'm using the one-tailed version of the test?

Yes, it's because you're doing a one-tailed version of the test; no matter how far the sample location is in the 'wrong' direction (i.e. the direction inconsistent with the alternative), it's still consistent with the null - so you're only considering one-sided bounds.

would that mean I would be justified in saying "with a 95% confidence x[,5]'s mean will be within -72 of x[,6]'s?"

No it wouldn't justify that statement. For starters you're not testing means at all unless you make some additional assumptions that would make difference in means coincide with the population equivalent of the location-shift estimate for the test.

In the second place, the location-difference could be in the 'wrong' direction, so 'within' doesn't quite work either.

In the third place, two locations aren't normally considered to be 'within' a negative distance of each other.

You could say something like "the estimated improvement from the first to the second algorithm was 21" (and then give the units!). Note that I said 21 and not 72. If you explain to the reader what the pseudo-median of the differences is, you can give more detail about what this difference is measuring.

What does the V value mean with regard to my data?

It's the value of the Signed Rank statistic. Check the references mentioned below for how it's calculated (particularly Hollander & Wolfe if you can find it since that's the references given in the R help, so the statistic is sure to correspond).

Specifically, the two main definitions that I've seen are either that all signed ranks are added (this is the version on the Wikipedia page), OR that only the positive-signed ranks are added. It looks like R uses the second one. That is, if $x$ and $y$ are the two paired samples, so the differences $x-y$ are tested, then

 sum(rank(abs(x-y))[x>y])

should give the same statistic as R. Like so:

> sum(rank(abs(x[,5]-x[,6]))[x[,5]>x[,6]])
[1] 22

From what I can see it is the difference between median(x[,5]) and median(x[,6]

It isn't. Well, they might coincide occasionally (as with your sample) but that's not what is going on. You should probably start by reading up about how the statistic works. I'd suggest something like Conover's Practical Nonparametric Statistics. Or, ideally, you could check the Signed Rank Test reference in the R help on wilcox.test (Hollander & Wolfe).

The actual value of the statistic isn't usually of interest. The estimate of the size of the location-shift would be relevant (and doesn't depend on which definition of the statistic is used). That is, the fact that 0 is inside the interval matters a lot, the "-21" matters somewhat, the "-72" might matter, the "22" probably doesn't (though there's little harm in quoting it if the definition of the statistic is clear to the reader).

Best Answer

Related Solutions

R – Two-Sample One-Sided Kolmogorov-Smirnov Test vs. One-Sided Wilcoxon-Mann-Whitney Test

R Wilcoxon Signed Rank Test – Understanding One-Tailed Wilcoxon Signed Rank Test Output in R

Related Question