Solved – ANOVA multicollinearity adjustment

anovacovarianceleast squaresmulticollinearity

I am using the statsmodel.ols module to compute an omnibus (ANOVA) F-test for three within-subjects factors; 2*3*2 levels design. The Cond. No. of the omnibus test (26.2) suggests multicollinearity. My understanding is that this means the model parameters are correlated, is that a correct statement? And a follow-up, what is the appropriate remedy in this case? E.g., Non-parametric alternative or some sort of adjustment.

Edit: The question here concerns electrophysiological time series data for repeated measurements. I have 3 independent categorical variables (x1, x2, x3) with 3x2x2 levels respectively. Briefly, I have specific hypotheses about whether auditory stimulation (3 distinct types) and socioeconomic status level (low vs. high) modulate neural activity recorded at the scalp from sensor arrays over the two hemispheres (left vs. right) in the brain.

The initial omnibus ANOVA indicated a significant interaction between factors (stimulusseshemisphere), and also multicollinearity on the basis of the Cond. No. metric in statsmodel.ols. Prior to this question I was unaware of VIF, thus this question does overlap with Multicollinearity when individual regressions are significant, but VIFs are low, though I am unsure about the VIFs in this case.

Best Answer

First, multicolinearity indicates that there is a linear relationship among your independent variables. Correlation is neither a necessary nor a sufficient condition for collinearity (although, with only 3 IVs, it is very hard to have one without the other - with more IVs, it is entirely possible).

Second, if you are deciding between ridge and lasso, I would go with ridge regression here. See this thread for some notes on ridge regression with categorical variables. Ridge regression produces biased parameter estimates in order to reduce the variance of the estimates. It won't (usually) remove variables entirely. Lasso removes some variables from the equation and that probably isn't what you want here, especially if the interaction is important.

Third, I think partial least squares is a better solution to collinearity than principal components, because PLS also considers the relationship with the dependent variable. However, with only three independent variables, you are likely to get a single component and I think it is unlikely that that will give you a useful result. Also, see this thread for some notes on PLS with categorical variables.

Finally, have you considered regression trees and their offshoots such as random forests?