Solved – chi-squared test when variables dependent

chi-squared-testnon-independentr

For a square matrix, is it appropriate to use a chi-squared distribution when each level of the variables are assumed to have the same overall frequency?

Specifically, I'm analyzing a dataset of the number of genes that have increased expression in an experimental treatment in two related species. My data look like this, with species 1 on the columns and species 2 on the rows:

                  Low       Intermediate      High
 Low              2594          163            405

 Intermediate     1350          558            155

 High              467           65            322

I a priori expect each class to have been the same in the common ancestor of the species (off-diagonals = 0). That is:

                  Low       Intermediate      High
 Low              3786          0              0

 Intermediate      0          1425             0

 High              0           0              868

My question is which of the off-diagonal cells have diverged more than expected by chance.

As a simple first pass, I've modified the standard chisq.test (in R) to use the overall total for each class (Low, Intermediate, High) rather than the marginal total for each class (assuming species independent…which they aren't).

# data
d <- matrix(c(2594L, 1350L, 467L, 163L, 558L, 65L, 405L, 155L, 322L), nrow=3, ncol=3)

# row and column sums
rs <- rowSums(d)
cs <- colSums(d)

# grand mean for each class
gm <- (rs + cs) / sum(d * 2)

Ec <- outer(gm, gm, "*") * sum(d)

where Ec is the Expected value for each cell using the grand mean for each class.

Is it reasonable to use a chisq distribution to determine if the observed values deviate from the expected values by more than chance?

Ec.chistat <- sum((data-Ec)^2 / Ec)
pchisq(Ec.chistat, df=(nrow(data)-1) * (ncol(data)-1), lower.tail = FALSE)

I realize I could probably use a GLM for this, but it's convenient to keep in table format to directly address which of the off-diagonals have increased more than expected by chance.

Note: for comparison, the standard chisq assuming independence of variable would be:

rs <- rowSums(d)
cs <- colSums(d)
n <- sum(d)
(E <- outer(rs, cs, "*")/n)
chistat <- sum((d - E) ^ 2 / E)
pchisq(chistat, df=(nrow(d)-1) * (ncol(d)-1), lower.tail = FALSE)

Best Answer

When you say the variables are dependent, I think you mean to say that the observations are dependent. Since your contingency table shows the same response levels for row and column variables, I am guessing that you have paired polytomous data. Examples of this would be self-assessment of socio-economic status in husband-wife dyads, or presence/absence of a congenital disease in heterozygous twin-pairs.

With paired binary data, there are two types of hypotheses that one can test: rater agreement versus standard row/column dependence.

With agreement, the outcome of interest can be conceived of as a binary indicator of Yes (agree) or No (disagree), so the frequencies of the diagonal of the matrix are of interest. Testing for significance of agreement can be based on percentage agreement or Cohen's Kappa (which accounts for events that are not 50/50 on average). For polytomous responses, Cohen's Kappa has a weighted analogue that can borrow information from adjacent cells.

With dependence and binary outcomes, McNemar's paired odds ratio can be used to infer a difference in the frequency of paired responses by conditioning on the first response and measuring the second response. This has the effect of discarding concordant pairs, since they tell us nothing about the differences.

With ordinal paired-outcomes, there have been several methods developed, and none of them have been received favorably. It turns out that the paired t-test has very good power to detect a difference in "mean ordinal responses" for paired ordinal outcomes. In your case of low / intermediate / high, the ordinal coding of these values would be 0 / 1 / 2.

Related Question