Solved – Residuals colors of the mosaic plot are all dark. Implies

chi-squared-testcontingency tablesdata visualizationinterpretationr

I am working on interpreting the nature of the association of a contingency table. I tried to provide a mosaic plot of the data that I have and it resulted to

enter image description here

Can anyone help me interpret this data set?

This is the data set

enter image description here

Best Answer

The chi-squared test for a two-way contingency table is a test of independence. There is a certain proportion of the total counts in the table in each cell. That proportion can be compared to the proportion that would be expected if the rows and columns were independent. For a given cell, the proportion expected under independence is the proportion of total counts in its row times the proportion of total counts in its column. If the cell proportions differ from the expected proportions by enough, it is no longer reasonable to believe that the data come a process in which the row variable is independent of the column variable and the test will be significant.

To help interpret your data, we can also examine the residuals of the table: that is, the scaled difference between the observed proportion and the expected proportion to see where in the table the differences lie. That is what the colors represent in the plot. Blues mean the observed proportion is higher than it 'ought' to be, and reds mean it is lower. The darker colors indicate the scaled difference is large, whereas lighter colors mark smaller differences.

Your data are far from independence with some cells having much larger counts and some much smaller counts than would be found if the provinces and crop types were unrelated.


Update:

myData = read.table(text="crop.type ag.n    ag.s    d.i s.n s.s
corn        15153   97864   63  1558    10768
irrg.palay  79300   145525  4698    48714   77720
palay       99786   292019  6507    63694   112493
rain.palay  20486   146494  1809    14980   34773
wht.corn        9983    38846   14  682 7562
yel.corn    5170    59018   50  876 3206", header=T)
rn = myData[,1]
cn = names(myData)[2:6]
myData = as.table(as.matrix(myData[,-1]))
rownames(myData) = rn
names(dimnames(myData)) = c("crop.type", "province")
myData = myData[c(1,5,6,3,2,4),]
myData
#             province
# crop.type     ag.n   ag.s  d.i   s.n    s.s
#   corn       15153  97864   63  1558  10768
#   wht.corn    9983  38846   14   682   7562
#   yel.corn    5170  59018   50   876   3206
#   palay      99786 292019 6507 63694 112493
#   irrg.palay 79300 145525 4698 48714  77720
#   rain.palay 20486 146494 1809 14980  34773
rbind(myData[1,], myData[2,]+myData[3,])
#       ag.n  ag.s d.i  s.n   s.s
# [1,] 15153 97864  63 1558 10768
# [2,] 15153 97864  64 1558 10768
rbind(myData[4,], myData[5,]+myData[6,])
#       ag.n   ag.s  d.i   s.n    s.s
# [1,] 99786 292019 6507 63694 112493
# [2,] 99786 292019 6507 63694 112493

Looking more closely at your data reveals several issues. First, you have both crop types (corn and palay--rice, I gather), and subtypes (e.g., white.corn and yellow.corn) in the same table. Those cannot be analyzed together. (You also seem to have a typo somewhere in the corn values in the Dinagat islands.)

So you need to decide what data you actually want to analyze. One possibility is to analyze the subtypes by region. The other possibility is to analyze the types, followed by separate analyses of each set of subtypes.

d1 = myData[c(2:3,5:6),]
d2 = myData[c(1,4),]
dc = myData[2:3,]
dp = myData[5:6,]

You have so much data that all of those analyses are wildly significant.

chisq.test(d1)
# X-squared = 79688, df = 12, p-value < 2.2e-16
chisq.test(d2)
# X-squared = 34481, df = 4, p-value < 2.2e-16
chisq.test(dc)
# X-squared = 6539.5, df = 4, p-value < 2.2e-16
chisq.test(dp)
# X-squared = 39611, df = 4, p-value < 2.2e-16

Having determined that the results are significant, there are various ways to try to better understand the pattern in your data. When you have >2 rows and columns (as you do with d1), one way to explore the contingency table is to run a correspondence analysis (A). Rows and columns are plotted with different symbols and colors. Rows (columns) that are closer together are more similar. In addition, the locations of the columns are plotted relative to each row such that the columns that make up the highest proportion of the row's total are closest (and vice versa).

You can also make a mosaic plot, but bear in mind that it plots proportions of one variable conditional on the other variable. That is, you will get different plots of rows on columns (C) than columns on rows (B). If you think of one variable as more of a response variable, you should make that the conditional one. I gather you think of crop.type as more of a response, but you plotted that as the variable conditioned on instead. (To get the plot switched around in R, you use mosaicplot(t(table)).)

library(ca)
windows()
  layout(matrix(c(1,1,2,3), nrow=2, ncol=2, byrow=T))
  plot(ca(d1), ylim=c(-.25,.25), main="A")
  mosaicplot(d1,    shade=TRUE, main="B")
  mosaicplot(t(d1), shade=TRUE, main="C")

enter image description here

These are all ways to explore the global pattern in your table. You ask which cell is contributing to the significant effect, but it is plain they all are. You can always examine the residuals directly.

chisq.test(d1)$residuals
#             province
# crop.type         ag.n       ag.s        d.i       s.n       s.s
#   wht.corn     6.28086   39.50987 -22.545986 -63.60499 -24.84981
#   yel.corn   -57.11307  107.44259 -23.351946 -68.83261 -80.46214
#   irrg.palay  86.21429 -118.48605  23.458883  85.24063  60.03836
#   rain.palay -81.30681   70.94950  -5.359382 -37.79327 -18.93443
chisq.test(d1)$residuals[which.max(abs(chisq.test(d1)$residuals))]
# [1] -118.486

On the other hand if you have a contingency table with just two rows, a natural way to plot / think about / explore the data is as a set of proportions.

props = function(tab){ tab[1,]/colSums(tab) }
my.plot = function(tab, ...){
  cs    = colSums(tab)
  props = props(tab)
  SE    = sqrt((props*(1-props))/cs)
  bp    = barplot(props, ylim=c(0,1), ...)
  box()
  arrows(x0=bp, y0=props-SE, y1=props+SE, code=3, angle=90, length=.1)
}
windows()
  layout(matrix(1:3, nrow=3))
  my.plot(d2, ylab="% corn",            main="Corn vs. Palay")
  my.plot(dc, ylab="% white corn",      main="White vs Yellow Corn")
  my.plot(dp, ylab="% irrigated palay", main="Irrigated vs Rain-fed Palay",xlab="Region")

enter image description here

Again, you have so much data that most of the error bars aren't even visible. Without bothering to actually test anything, it is reasonable to imagine that all bars are significantly different from each other.

In this latter set of analyses, you don't just have one contingency table but three. You may want to know if the profile is similar across the regions. But this is harder to see across a set of bar graphs. A solution is to use a dotplot with all three sets of proportions represented together.

ord = order(d2[1,]/colSums(d2))
windows()
  dotchart(props(d2)[ord], pch="c", labels=colnames(d2)[ord], xlim=c(0,1))
  points(props(d2)[ord], 1:5, pch="c", col="#1b9e77")
  points(props(dc)[ord], 1:5, pch="w", col="#d95f02")
  points(props(dp)[ord], 1:5, pch="i", col="#7570b3")
  legend("top", legend=c("% corn","% white corn","% irrigated palay"), horiz=TRUE, 
         pch=c("c","w","i"), col=c("#1b9e77","#d95f02","#7570b3"))

enter image description here

A different solution would be to make a scatterplot matrix of the proportions.

dframe = data.frame(corn.prct=props(d2), white.prct=props(dc), irrigated.prct=props(dp))
windows()
  pairs(dframe, pch=c("A","g","d","S","u"))
  par(xpd=TRUE)
  legend(.225, .338, legend=colnames(d2), pch=c("A","g","d","S","u"))

enter image description here