Solved – Collinearity between categorical variables

anovacategorical datamulticollinearityrsums-of-squares

There's a lot about collinearity with respect to continuous predictors but not so much that I can find on categorical predictors. I have data of this type illustrated below.

The first factor is a genetic variable (allele count), the second factor is a disease category. Clearly the genes precede the disease and are a factor in showing symptoms that lead to a diagnosis. However, a regular analysis using type II or III sums of squares, as would be commonly done in psych with SPSS, misses the effect. A type I sums of squares analysis picks it up, when the appropriate order is entered because it is order dependent. Further, there are likely to be extra components to the disease process which are not related to the gene that are not well identified with type II or III, see anova(lm1) below vs lm2 or Anova.

Example data:

set.seed(69)
iv1 <- sample(c(0,1,2), 150, replace=T)
iv2 <- round(iv1 + rnorm(150, 0, 1), 0)
iv2 <- ifelse(iv2<0, 0, iv2)
iv2 <- ifelse(iv2>2, 2, iv2)
dv  <- iv2 + rnorm(150, 0, 2)
iv2 <- factor(iv2, labels=c("a", "b", "c"))
df1 <- data.frame(dv, iv1, iv2)

library(car)
chisq.test(table(iv1, iv2))          # quick gene & disease relations
lm1 <- lm(dv~iv1*iv2, df1);    lm2 <- lm(dv~iv2*iv1, df1)
anova(lm1);                    anova(lm2)
Anova(lm1, type="II");         Anova(lm2, type="II")

lm1 with type I SS to me seems the appropriate way to analyse the data given the background theory. Is my assumption correct?
I'm used to explicitly manipulated orthogonal designs, where these problems don't usually pop up. Is it difficult to convince reviewers that this is the best process (assuming point 1 is correct) in the context of an SPSS centric field?
And what to report in the stats section? Any extra analysis, or comments that should go in?

Best Answer

Collinearity between factors is quite complicated. The classical example is the one you get when you group and dummy-encode the three continuous variables 'age', 'period' and 'year'. It is analysed in:

Kupper, L.L., Janis, J.M., Salama, I.A., Yoshizawa, C.N. Greenberg, B.G., & Winsborough, H.H. (1983). Age-period-cohort analysis: an illustration in the problems assessing interaction in one observation per cell data, Communicatios in Statistics - Theory and Methods, 12, 23, pp. 201-217.

The coefficients you get, after removing four (not three) references are only identified up to an unknown linear trend. This can be analysed because the collinearity arises from a known collinearity in the source variables (age+year=period).

Some work has also been done on spurious collinearity between two factors. It has been analysed in:

Eccleston, J.A. & Hedayat, A. (1974). On the theory of connected designs: Characterization and optimality, The Annals of Statistics, 2, 6, pp. 1238-1255.

The upshot is that collinearity among categorical variables means that the dataset must be split into disconnected parts, with a reference level in each component. Estimated coefficients from different components can not be compared directly.

For more complicated collinearities between three or more factors, the situation is complicated. There do exist procedures for finding estimable functions, i.e. linear combinations of the coefficients which are interpretable, e.g. in:

"On the connectivity of row-column designs" by Godolphin and Godolphin in Utilitas Mathematica (60) pp 51-65

But to my knowledge no general silver-bullet for handling such collinearities in an intuitive way exists.

Related Solutions

Type-III SS ANOVA R – How to Perform with Contrast Codes

Type III sum of squares for ANOVA are readily available through the Anova() function from the car package.

Contrast coding can be done in several ways, using C(), the contr.* family (as indicated by @nico), or directly the contrasts() function/argument. This is detailed in §6.2 (pp. 144-151) of Modern Applied Statistics with S (Springer, 2002, 4th ed.). Note that aov() is just a wrapper function for the lm() function. It is interesting when one wants to control the error term of the model (like in a within-subject design), but otherwise they both yield the same results (and whatever the way you fit your model, you still can output ANOVA or LM-like summaries with summary.aov or summary.lm).

I don't have SPSS to compare the two outputs, but something like

> library(car)
> sample.data <- data.frame(IV=factor(rep(1:4,each=20)),
                            DV=rep(c(-3,-3,1,3),each=20)+rnorm(80))
> Anova(lm1 <- lm(DV ~ IV, data=sample.data, 
                  contrasts=list(IV=contr.poly)), type="III")
Anova Table (Type III tests)

Response: DV
            Sum Sq Df F value    Pr(>F)    
(Intercept)  18.08  1  21.815  1.27e-05 ***
IV          567.05  3 228.046 < 2.2e-16 ***
Residuals    62.99 76                      
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

is worth to try in first instance.

About factor coding in R vs. SAS: R considers the baseline or reference level as the first level in lexicographic order, whereas SAS considers the last one. So, to get comparable results, either you have to use contr.SAS() or to relevel() your R factor.

Type-III Sums of Squares – Should an Argument be Included to Request Type-III Sums of Squares in ezANOVA?

Just to amplify - I am the most recent requestor, I believe.

In specific comment on Mike's points:

It's clearly true that the I/II/III difference only applies with correlated predictors (of which unbalanced designs are the most common example, certainly in factorial ANOVA) - but this seems to me to be an argument that dismisses the analysis of the unbalanced situation (and hence any Type I/II/III debate). It may be imperfect, but that's the way things happen (and in many contexts the costs of further data collection outweigh the statistical problem, caveats notwithstanding).
This is completely fair and represents the meat of most of the "II versus III, favouring II" arguments I've come across. The best summary I've encountered is Langsrud (2003) "ANOVA for unbalanced data: Use Type II instead of Type III sums of squares", Statistics and Computing 13: 163-167 (I have a PDF if the original is hard to find). He argues (taking the two-factor case as the basic example) that if there's an interaction, there's an interaction, so consideration of main effects is usually meaningless (an obviously fair point) - and if there's no interaction, the Type II analysis of main effects is more powerful than the Type III (undoubtedly), so you should always go with Type II. I've seen other arguments (e.g. Venables, Fox) that emphasize the meaning (or lack of) of considering hypotheses about main effects in the presence of interactions, and/or/equivalently suggesting that the Type III assumptions about the null hypothesis are often not sensible (e.g. Langsrud).
And I agree with this: if you have an interaction but have some question about the main effect as well, then you're probably into do-it-yourself territory.

Clearly there are those who just want Type III because SPSS does it, or some other reference to statistical Higher Authority. I am not wholly against this view, if it comes down to a choice of a lot of people sticking with SPSS (which I have some things against, namely time, money, and licence expiry conditions) and Type III SS, or a lot of people shifting to R and Type III SS. However, this argument is clearly a lame one statistically.

However, the argument that I found rather more substantial in favour of Type III is that made independently by Myers & Well (2003, "Research Design and Statistical Analysis", pp. 323, 626-629) and Maxwell & Delaney (2004, "Designing Experiments and Analyzing Data: A Model Comparison Perspective", pp. 324-328, 332-335). That is as follows:

if there's an interaction, all methods give the same result for the interaction sum of squares
Type II assumes that there's no interaction for its test of main effects; type III doesn't
Some (e.g. Langsrud) argue that if the interaction is not significant, then you're justified in assuming that there isn't one, and looking at the (more powerful) Type II main effects
But if the test of the interaction is underpowered, yet there is an interaction, the interaction may come out "non-significant" yet still lead to a violation of the assumptions of the Type II main effects test, biasing those tests to be too liberal.
Myers & Well cite Appelbaum/Cramer as the primary proponents of the Type II approach, and go on [p323]: "... More conservative criteria for nonsignificance of the interaction could be used, such as requiring that the interaction not be significant at the .25 level, but there is insufficient understanding of the consequences of even this approach. As a general rule, Type II sums of sqaures should not be calculated unless there is strong a priori reason to assume no interaction effects, and a clearly nonsignificant interaction sum of squares." They cite [p629] Overall, Lee & Hornick 1981 as a demonstration that interactions that do not approach significance can bias tests of main effects. Maxwell & Delaney [p334] advocate the Type II approach if the population interaction is zero, for power, and the Type III approach if it isn't [for the interpretability of means derived from this approach]. They too advocate using Type III in the real-life situation (when you're making inferences about the presence of the interaction from the data) because of the problem of making a type 2 [underpowered] error in the interaction test and thus accidentally violating the assumptions of the Type II SS approach; they then make similar further points to Myers & Well, and note the long debate on this issue!

So my interpretation (and I'm no expert!) is that there's plenty of Higher Statistical Authority on both sides of the argument; that the usual arguments put forward aren't about the usual situation that would give rise to problems (that situation being the common one of interpreting main effects with a non-significant interaction); and that there are fair reasons to be concerned about the Type II approach in that situation (and it comes down to a power versus potential over-liberalism thing).

For me, that's enough to wish for the Type III option in ezANOVA, as well as Type II, because (for my money) it's a superb interface to R's ANOVA systems. R is some way from being easy to use for novices, in my view, and the "ez" package, with ezANOVA and the rather lovely effect plotting functions, goes a long way towards making R accessible to a more general research audience. Some of my thoughts-in-progress (and a nasty hack for ezANOVA) are at http://www.psychol.cam.ac.uk/statistics/R/anova.html .

Would be interested to hear everyone's thoughts!

Best Answer

Related Solutions

Type-III SS ANOVA R – How to Perform with Contrast Codes

Type-III Sums of Squares – Should an Argument be Included to Request Type-III Sums of Squares in ezANOVA?

Related Question