Solved – Comparing multiple distinct proportions between two groups

chi-squared-testfishers-exact-testhypothesis testingmultiple-comparisonsp-value

I have two groups (group 1: $n=100$, and group 2: $n=200$), and multiple proportions for each group (where the number represents the proportion of individuals in the group with each disease). Example:

                           group1    group2
high cholesterol           0.20      0.28
high blood pressure        0.18      0.16
cardiovascular disease     0.13      0.20
diabetes                   0.25      0.20
vitamin d deficiency       0.05      0.15

I want to calculate whether there is a significant difference between the two groups, overall, across the disease categories. Since this data is not a contingency table, I clearly cannot use a chi-squared test or a Fisher's Exact Test. I know how to compare single proportions across two groups, but is there a way to compare multiple proportions across the two groups simultaneously and get a single p-value? Of course, I could test each disease individually with a two-proportion z-test and then adjust for multiple comparisons, but can I test everything at once (in the flavor of a Fisher Exact Test)?

UPDATE: The categories are not disjoint as an individual can have 0 diseases or $>1$ diseases (seen by the fact that the proportions for either group do not add up to 1), which is why we cannot use the usual strategies that are used for a contingency table. In essence, instead of comparing the proportions $A$ and $B$ across two groups, I am trying to compare the vectors of proportions, $A = [a_1, a_2, a_3, a_4, a_5]$ and $B = [b_1, b_2, b_3, b_4, b_5]$ across two groups.

Best Answer

Continuing from my comment: On the assumption that disease categories are mutually exclusive, and using an additional category None so that groups total $n_1 = 100, n_2 = 200,$ as stated, here is a chi-squared test of homogeneity (in R) of disease category across groups.

G1 = c(20, 17, 13, 25,  5, 20)
G2 = c(56, 32, 40, 40, 20, 12)
TBL = rbind(G1, G2)
out = chisq.test(TBL);  our

        Pearson's Chi-squared test

data:  TBL
X-squared = 18.593, df = 5, p-value = 0.002288

The null hypothesis of homogeneity is rejected (P-value $0.0023).$

Observed counts $X_{ij}$ echo the input, expected counts $E_{ij}$ are based on row and column totals of the table (assuming homogeneity). For example, $E_{11} = 100(76/300) = 25.33333.$

The chi-squared statistic (X-squared in output) is $$ Q = \sum_{i=1}^2\sum_{j=1}^6 \frac{(X_{ij}-E_{ij})^2}{E_{ij}}=18.593,$$ which is distributed approximately as $\mathsf{Chisq}(\nu),$ where the number of degrees of freedom is $\nu = (2-1)(6-1) = 5.$ The P-value is the probability $0.0023$ under the density curve of $\mathsf{Chisq}(5)$ to the right of $18.593.$

enter image description here

In order for $Q$ to have this chi-squared distribution the $E_{ij}$s should exceed $5,$ which is true for your data.

out$obs
   [,1] [,2] [,3] [,4] [,5] [,6]
G1   20   17   13   25    5   20
G2   56   32   40   40   20   12
out$exp
       [,1]     [,2]     [,3]     [,4]      [,5]     [,6]
G1 25.33333 16.33333 17.66667 21.66667  8.333333 10.66667
G2 50.66667 32.66667 35.33333 43.33333 16.666667 21.33333
out$res
         [,1]       [,2]      [,3]       [,4]       [,5]      [,6]
G1 -1.0596259  0.1649572 -1.110272  0.7161149 -1.1547005  2.857738
G2  0.7492686 -0.1166424  0.785081 -0.5063697  0.8164966 -2.020726

The Pearson residuals are the square roots of the the $rc = 12$ contributions $C_{ij} = \frac{(X_{ij}-E_{ij})^2}{E_{ij}},$ given the signs of the differences $D_{ij} = X_{ij}-E_{ij}.$

Residuals with the largest absolute values point the way to the contributions most responsible for a large enough value $Q$ to lead to rejection. Here the key residuals are for the category None, so number of G1 subjects not having one of the five diseases is larger than expected if categories were homogeneous across groups. Otherwise, disease categories 1 and 5 seem different among the groups.

Separate ad hoc tests (perhaps at the 1% level to avoid 'false discovery' according to the Bonferroni method), would show which differences are significant.

Related Question