Solved – Collinearity between categorical variables

anovacategorical datamulticollinearityrsums-of-squares

There's a lot about collinearity with respect to continuous predictors but not so much that I can find on categorical predictors. I have data of this type illustrated below.

The first factor is a genetic variable (allele count), the second factor is a disease category. Clearly the genes precede the disease and are a factor in showing symptoms that lead to a diagnosis. However, a regular analysis using type II or III sums of squares, as would be commonly done in psych with SPSS, misses the effect. A type I sums of squares analysis picks it up, when the appropriate order is entered because it is order dependent. Further, there are likely to be extra components to the disease process which are not related to the gene that are not well identified with type II or III, see anova(lm1) below vs lm2 or Anova.

Example data:

set.seed(69)
iv1 <- sample(c(0,1,2), 150, replace=T)
iv2 <- round(iv1 + rnorm(150, 0, 1), 0)
iv2 <- ifelse(iv2<0, 0, iv2)
iv2 <- ifelse(iv2>2, 2, iv2)
dv  <- iv2 + rnorm(150, 0, 2)
iv2 <- factor(iv2, labels=c("a", "b", "c"))
df1 <- data.frame(dv, iv1, iv2)

library(car)
chisq.test(table(iv1, iv2))          # quick gene & disease relations
lm1 <- lm(dv~iv1*iv2, df1);    lm2 <- lm(dv~iv2*iv1, df1)
anova(lm1);                    anova(lm2)
Anova(lm1, type="II");         Anova(lm2, type="II")
  1. lm1 with type I SS to me seems the appropriate way to analyse the data given the background theory. Is my assumption correct?
  2. I'm used to explicitly manipulated orthogonal designs, where these problems don't usually pop up. Is it difficult to convince reviewers that this is the best process (assuming point 1 is correct) in the context of an SPSS centric field?
  3. And what to report in the stats section? Any extra analysis, or comments that should go in?

Best Answer

Collinearity between factors is quite complicated. The classical example is the one you get when you group and dummy-encode the three continuous variables 'age', 'period' and 'year'. It is analysed in:

The coefficients you get, after removing four (not three) references are only identified up to an unknown linear trend. This can be analysed because the collinearity arises from a known collinearity in the source variables (age+year=period).

Some work has also been done on spurious collinearity between two factors. It has been analysed in:

The upshot is that collinearity among categorical variables means that the dataset must be split into disconnected parts, with a reference level in each component. Estimated coefficients from different components can not be compared directly.

For more complicated collinearities between three or more factors, the situation is complicated. There do exist procedures for finding estimable functions, i.e. linear combinations of the coefficients which are interpretable, e.g. in:

  • "On the connectivity of row-column designs" by Godolphin and Godolphin in Utilitas Mathematica (60) pp 51-65

But to my knowledge no general silver-bullet for handling such collinearities in an intuitive way exists.

Related Question