Chi-squared Test – Chi-squared for Multiple Samples

chi-squared-testmultiple-comparisons

I have a set of data from an experiment where people are asked to remember items of different types; let's call them A, B and C, such that they can either recall them or not. Here is an example contingency table:

                   A     B     C
Recalled          20    15    10
Did not recall     5     8    10

I am supposed to find out whether the probability of recalling items of type A is higher than the probability of recalling B or C (I'm not interested in testing B against C).

If I were to test two different samples, let's say A against B, I would use a chi-squared test. However, since I am testing three different samples and not two, I believe I would be introducing an error if I just ran two separate chi-squared tests. If I was testing means of samples I could use the Kruskal-Wallis method and then pairwise Mann-Whitney U tests, but I do not know of any method that I could use for testing probabilities of samples.

Best Answer

It is certainly fine to do pairwise chi-square tests, but that isn't the only possibility. Another is to fit a generalized linear model and follow it up with pairwise comparisons of its predictions. In R, it goes something like this:

> example = data.frame(trt = factor(c("A","B","C")),
+   rec = c(20,15,10), not = c(5,8,10))

> example.glm = glm(cbind(rec, not) ~ trt, data = example, 
+   family = binomial())

This fits a logistic regression model for predicting $\log\{p_i/(1-p_i)\}, i=1,2,3$. A chi-squared test (not the same as the Pearson chi-square, but similar) for $H_0:p_1=p_2=p_3$ is obtained via

> anova(example.glm)
Analysis of Deviance Table
Model: binomial, link: logit
Response: cbind(rec, not)

Terms added sequentially (first to last)
     Df Deviance Resid. Df Resid. Dev
NULL                     2     4.5545
trt   2   4.5545         0     0.0000

so that the test statistic is $\chi^2 = 4.55$ with 2 d.f.

The post-hoc estimates and comparisons are done in a manner similar to that for ordinary ANOVA models:

> library(lsmeans)
Loading required package: estimability

> lsmeans(example.glm, pairwise ~ trt)
$lsmeans
 trt    lsmean        SE df  asymp.LCL asymp.UCL
 A   1.3862944 0.4999999 NA  0.4063126 2.3662762
 B   0.6286087 0.4377975 NA -0.2294587 1.4866760
 C   0.0000000 0.4472136 NA -0.8765225 0.8765225

Results are given on the logit (not the response) scale. 
Confidence level used: 0.95 

$contrasts
 contrast  estimate        SE df z.ratio p.value
 A - B    0.7576857 0.6645800 NA   1.140  0.4893
 A - C    1.3862944 0.6708203 NA   2.067  0.0969
 B - C    0.6286087 0.6258328 NA   1.004  0.5740

Results are given on the log (not the response) scale. 
P value adjustment: tukey method for comparing a family of 3 estimates 
Tests are performed on the log scale

The least-squares means (first table) are predictions from the model for $\log\{p_i/(1-p_i)\}$ and the contrasts are pairwise comparisons of these quantities. Alternatively, you could back-transform these results and obtain estimates of the $p_i$ themselves, and of the odds ratios $\frac{p_i}{1-p_i}/\frac{p_j}{1-p_j}$:

> lsmeans(example.glm, pairwise ~ trt, type = "response")
$lsmeans
 trt      prob         SE df asymp.LCL asymp.UCL
 A   0.8000000 0.07999999 NA 0.6002034 0.9142193
 B   0.6521739 0.09931135 NA 0.4428857 0.8155788
 C   0.5000000 0.11180340 NA 0.2938989 0.7061011

Confidence level used: 0.95 
Intervals are back-transformed from the logit scale 

$contrasts
 contrast odds.ratio       SE df z.ratio p.value
 A - B      2.133333 1.417771 NA   1.140  0.4893
 A - C      4.000000 2.683281 NA   2.067  0.0969
 B - C      1.875000 1.173436 NA   1.004  0.5740

P value adjustment: tukey method for comparing a family of 3 estimates 
Tests are performed on the log scale

The advantage of this approach is that you obtain comparisons of meaningful quantities, rather than just chi-squares and $P$ values. The Tukey adjustment on the comparisons is only approximate; but then, so are the results of pairwise chi-squared tests, and the Bonferroni correction is more conservative.

Related Solutions

Solved – Adjusting for multiple Kruskal-Wallis tests

Let's step back and look at what the data would look like. From what you describe, 3 algorithms (i.e. groups or treatments) and 10 datasets (i.e. subjects). In this case, you have a a within-subjects design (i.e. repeated measures) with one factor. One way to represent this is like this:

set.seed(123)
df <- data.frame(dataset = rep(seq(10), 3), 
                 algorithm = rep(c("ML1","ML2","ML3"), each=10), 
                 Accuracy = runif(30))
> df
   dataset algorithm   Accuracy
1        1       ML1 0.28757752
2        2       ML1 0.78830514
3        3       ML1 0.40897692
4        4       ML1 0.88301740
5        5       ML1 0.94046728
6        6       ML1 0.04555650
7        7       ML1 0.52810549
8        8       ML1 0.89241904
9        9       ML1 0.55143501
10      10       ML1 0.45661474
11       1       ML2 0.95683335
12       2       ML2 0.45333416
13       3       ML2 0.67757064
14       4       ML2 0.57263340
15       5       ML2 0.10292468
16       6       ML2 0.89982497
17       7       ML2 0.24608773
18       8       ML2 0.04205953
19       9       ML2 0.32792072
20      10       ML2 0.95450365
21       1       ML3 0.88953932
22       2       ML3 0.69280341
23       3       ML3 0.64050681
24       4       ML3 0.99426978
25       5       ML3 0.65570580
26       6       ML3 0.70853047
27       7       ML3 0.54406602
28       8       ML3 0.59414202
29       9       ML3 0.28915974
30      10       ML3 0.14711365

You will typically see examples that have 'subject' as a label. In your case, your 'subjects' are 'datasets'. If you can assume normality, you would do repeated-measures ANOVA. However, you state you know the accuracies are not normally distributed and you naturally want a non-parametric method. Your dataset is also balanced (10 samples/group) so we can use the Friedman test (which essentially is a nonparametric repeated-measures ANOVA).

If you get a significant p-value from the test, you would do post-hoc analysis with a pairwise paired Wilcoxon test with some sort of correction (e.g. bonferroni, holm, etc.). You would not use Mann-Whitney because you have 'paired/repeated measures' data.

Lastly, you probably want the effect size any significant differences. This also would use the wilcoxon test. In R there is no function I can recall right now but the equation is very simple:

$$r=\frac{Z}{sqrt(N)}$$

Where Z is the Z-score and N is the sample size (between the two groups being compared). You can get this Z-score using the wilcoxsign_test from the coin package.

Using the above data, this can be done in R with the following. Please note, the above data was just randomly generated so there is no significance. This is just for demonstrating some code:

# Friedman Test
friedman.test(Accuracy ~ algorithm|dataset, data=df)

# Post-hoc tests with 'bonferroni correction'
with(df, pairwise.wilcox.test(Accuracy, algorithm, p.adj="bonferroni", paired=T))

# Get Z-score for calculating effect-size
library(coin)
with(df, wilcoxsign_test(Accuracy ~ factor(algorithm)|factor(dataset), 
                         data=df[algorithm == "ML1" | algorithm == "ML2",]))

# Calculate effect size, in this case Z = -0.2548, two groups is 20 datasets
0.2548/sqrt(20)

Solved – Multiple comparisons after Kruskal Wallis using the FDR approach. How to compute P values (Dunn or Mann-Whitney)

From what I understood of the OP question:

1) He ran a omnibus Kruskal-Wallis with significant results

2) He want to run a pairwise test on all groups and he is in doubt whether to use Mann-Whitney or Dunn's test

3) He want to run his own multiple comparison adjustment procedure, so he needs the uncorrected p-values of each pairwise comparions.

The source of confusion is that Dunn test implemented in GraphPad seems to already include a multiple comparison adjustment (which looks like a Bonferroni adjustment - see http://www.graphpad.com/guides/prism/6/statistics/index.htm?stat_nonparametric_multiple_compari.htm).

Answering:

2) You should use Dunn test. Both the CV answer by @Alexis for Post-hoc tests after Kruskal-Wallis: Dunn's test or Bonferroni corrected Mann-Whitney tests? and this site from XLSAT http://www.xlstat.com/en/products-solutions/feature/kruskal-wallis-test.html agree that Dunn (or Conover-Iman or Steel-Dwass-Critchlow-Fligner ) are the appropriate post-hoc tests after a KW (disclosure - I did not know that until today - have been using Mann-Whitney as post-hoc to KW until today).

3) I did not understand the GaphPad page, but let me point you to the dunn.test package in R does what the OP want. In particular it distinguishes the Dunn test and multiple comparison adjustments, and one can set the adjustment method to "none", which will return the unadjusted p-values. Also notice that among the adjustment procedures there are the Benjamini-Hochberg (95) and the Benjamini-Yekutieli (2001) adjustments that are FDR (maybe one of them is the one the OP is thinking in using).

Let me stress of many of the commentators have been saying - there is no good reason to use the unadjusted p-values EXCEPT to implement your own adjustment procedure - no decision should be made based on the unadjusted p-values.

Best Answer

Related Solutions

Solved – Adjusting for multiple Kruskal-Wallis tests

Solved – Multiple comparisons after Kruskal Wallis using the FDR approach. How to compute P values (Dunn or Mann-Whitney)

Related Question