Linear Regression – Intuitive Explanation of Multicollinearity Issues

faqintuitionmulticollinearityregression

The wiki discusses the problems that arise when multicollinearity is an issue in linear regression. The basic problem is multicollinearity results in unstable parameter estimates which makes it very difficult to assess the effect of independent variables on dependent variables.

I understand the technical reasons behind the problems (may not be able to invert $X' X$, ill-conditioned $X' X$ etc) but I am searching for a more intuitive (perhaps geometric?) explanation for this issue.

Is there a geometric or perhaps some other form of easily understandable explanation as to why multicollinearity is problematic in the context of linear regression?

Best Answer

Consider the simplest case where $Y$ is regressed against $X$ and $Z$ and where $X$ and $Z$ are highly positively correlated. Then the effect of $X$ on $Y$ is hard to distinguish from the effect of $Z$ on $Y$ because any increase in $X$ tends to be associated with an increase in $Z$.

Another way to look at this is to consider the equation. If we write $Y = b_0 + b_1X + b_2Z + e$, then the coefficient $b_1$ is the increase in $Y$ for every unit increase in $X$ while holding $Z$ constant. But in practice, it is often impossible to hold $Z$ constant and the positive correlation between $X$ and $Z$ mean that a unit increase in $X$ is usually accompanied by some increase in $Z$ at the same time.

A similar but more complicated explanation holds for other forms of multicollinearity.

Related Solutions

Solved – Is multicollinearity implicit in categorical variables

I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.

The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.

Here is an R function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:

trial <- function(n, k1=2, k2=2) {
  df <- expand.grid(1:k1, 1:k2)
  df <- do.call(rbind, lapply(1:n, function(i) df))
  df$y <- rnorm(k1*k2*n)
  fit <- lm(y ~ Var1+Var2, data=df)
  vif(fit)
}

Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:

sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates

This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line

  df <- subset(df, subset=(y < 0))

before the fit line in trial. This removes half the data at random. Re-running

sapply(1:5, function(i) trial(i, 10, 3))

shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10)) produces comparable values.

Solved – Linear relationship between explanatory variables in multiple regression

There are two points here:

The passage recommends transforming IVs to linearity only when there is evidence of nonlinearity. Nonlinear relationships among IVs can also cause collinearity and, more centrally, may complicate other relationships. I am not sure I agree with the advice in the book, but it's not silly.
Certainly very strong linear relationships can be causes of collinearity, but high correlations are neither necessary nor sufficient to cause problematic collinearity. A good method of diagnosing collinearity is the condition index.

EDIT in response to comment

Condition indexes are described briefly here as "square root of the maximum eigenvalue divided by the minimum eigenvalue". There are quite a few posts here on CV that discuss them and their merits. The seminal texts on them are two books by David Belsley: Conditioning diagnostics and Regression Diagnostics (which has a new edition, 2005, as well).

Best Answer

Related Solutions

Solved – Is multicollinearity implicit in categorical variables

Solved – Linear relationship between explanatory variables in multiple regression

Related Question