Solved – Multicollinearity and categorical predictor with three levels

categorical datamulticollinearitypredictorregression

If I have a continuous Dependent Variable and two Independent Variables, where one is categorical with three levels and the other is continuous, what assumptions do I need to check for multiple regression?

Scatter plots are for continuous variables and multicollinearity makes sense for continuous, but not for dummy variables.

Best Answer

The most important assumptions to check are those for any multiple regression, as explained for example in Faraway's "Practical Regression and Anova using R," Chapter 7: tests for outliers and influential observations, a plot of residuals versus fitted values (an extremely useful scatter plot that incorporates both the categorical and the continuous predictor), tests of non-linearity and distributions of residuals, and so forth.

"Multicollinearity" would seem to be a bit of an overstatement with only 2 predictor variables. If you are concerned about collinearity, you could for example see how the values of the continuous predictor are distributed among the 3 levels of the categorical predictor. The Faraway reference noted above discusses collinearity in Chapter 9. As the answer from @jur notes, its practical importance depends on the intended use of the model.

Related Solutions

Solved – How to test for and remedy multicollinearity in optimal scaling/ordinal regression with categorical IVs

If you are using R, SPSS or Stata, you can look at the perturb package. It diagnoses collinearity by adding random noise to continuous variables; for categorical variables, some are changed to different categories.

In the documentation for perturb in R, it notes that the model need not be lm, implying that any model (including ones built with optimal scaling or ordinal logistic) could be used.

Solved – Is multicollinearity implicit in categorical variables

I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.

The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.

Here is an R function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:

trial <- function(n, k1=2, k2=2) {
  df <- expand.grid(1:k1, 1:k2)
  df <- do.call(rbind, lapply(1:n, function(i) df))
  df$y <- rnorm(k1*k2*n)
  fit <- lm(y ~ Var1+Var2, data=df)
  vif(fit)
}

Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:

sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates

This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line

  df <- subset(df, subset=(y < 0))

before the fit line in trial. This removes half the data at random. Re-running

sapply(1:5, function(i) trial(i, 10, 3))

shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10)) produces comparable values.

Best Answer

Related Solutions

Solved – How to test for and remedy multicollinearity in optimal scaling/ordinal regression with categorical IVs

Solved – Is multicollinearity implicit in categorical variables

Related Question