Solved – How to determine significant associations in a mosaic plot

contingency tablesdata visualizationinterpretationr

I have a common question on how to explain significant association between categorical variables in mosaic plot.

For example,in this plot,based on Pearson residuals, can we say that $[2.0, 5.1]$ and $[-3, -2.0]$ residuals values mean there is a statistically significant association in $40+$ age,with memory and moderate attitude? And how to consider Pearson residual value , we use $[2.0, 5.1]$ value or $[4.0,5.1]$ or $[-2.0 ,-3.0]$ also?

enter image description here

Best Answer

The formula for the standardized residuals is:

$$\begin{align}\text{Pearson's residuals}\,&=\,\frac{\text{Observed - Expected}}{ \sqrt{\text{Expected}}}\\ d_{ij}&=\frac{n_{ij}-m_{ij}}{\sqrt{m_{ij}}} \end{align}$$

where $m_{ij} = E( f_{ij})$ is the expected frequency of the $i$-th row and the $j$-th column.

The sum of squared standardized residuals is the chi square value.

From Extending Mosaic Displays: Marginal, Partial, and Conditional Views of Categorical Data by Michael Friendly

Under the assumption of independence, these values roughly correspond to two-tailed probabilities $p < .05$ and $p < .0001$ that a given value of $| d_{ij} |$ exceeds $2$ or $4$.

Notice the following footnote:

For exploratory purposes, we do not usually make adjustments (e.g., Bonferroni) for multiple tests because the goal is to display the pattern of residuals in the table as a whole. However, the number and values of these cutoffs can be easily set by the user.

We are dealing with a multi-way table, in reference to which the R documentation for the mosaicplot package states:

Extended mosaic displays show the standardized residuals of a loglinear model of the counts from by the color and outline of the mosaic's tiles. (Standardized residuals are often referred to a standard normal distribution.) Negative residuals are drawn in shaded of red and with broken outlines; positive ones are drawn in blue with solid outlines.


The fact that this is a three-way contingency table complicates the interpretation, which is very nicely explained in @roando2's answer.

Here is a simulation with a made-up table that resembles the OP to clarify the calculations:

tab_df = data.frame(expand.grid(
  age = c("15-24", "25-39", ">40"),
  attitude = c("no","moderate"),
  memory = c("yes", "no")),
  count = c(1,4,3,1,8,39,32,36,25,35,32,38) ) 
(tab = xtabs(count ~ ., data = tab_df))

, , memory = yes
       attitude
age     no moderate
  15-24  1        1
  25-39  4        8
  >40    3       39
, , memory = no
       attitude
age     no moderate
  15-24 32       35
  25-39 36       32
  >40   25       38

    summary(tab)
Call: xtabs(formula = count ~ ., data = tab)
Number of cases in table: 254 
Number of factors: 3 
Test for independence of all factors:
    Chisq = 78.33, df = 7, p-value = 3.011e-14

require(vcd)
mosaic(~ memory + age + attitude, data = tab, shade = T)
expected = mosaic(~ memory + age + attitude, data = tab, type = "expected") 
expected

# Finding, as an example, the expected counts in >40 with memory and moderate att.:

over_forty = sum(3,39,25,38)
mem_yes = sum(1,4,3,1,8,39)
att_mod = sum(1,8,39,35,32,38)
exp_older_mem_mod = over_forty * mem_yes * att_mod / sum(tab)^2

# Corresponding standardized Pearson's residual:

(39 - exp_older_mem_mod) / sqrt(exp_older_mem_mod) # [1] 6.709703

enter image description here

It is interesting to compare the graphical representation to the results of the Poisson regression, which illustrates perfectly the English interpretation in @rolando2 's answer:

fit <- glm(count ~ age + attitude + memory, data=tab_df, family=poisson())
summary(fit)

Call:
glm(formula = count ~ age + attitude + memory, family = poisson(), 
    data = tab_df)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)        1.7999     0.1854   9.708  < 2e-16 ***
age25-39           0.1479     0.1643   0.900  0.36794    
age>40             0.4199     0.1550   2.709  0.00674 ** 
attitudemoderate   0.4153     0.1282   3.239  0.00120 ** 
memoryno           1.2629     0.1514   8.344  < 2e-16 ***