Solved – reason to prefer a specific measure of multicollinearity

multicollinearity

When working with many input variables, we are often concerned about multicollinearity. There are a number of measures of multicollinearity that are used to detect, think about, and / or communicate multicollinearity. Some common recommendations are:

The multiple $R^2_j$ for a particular variable
The tolerance, $1-R^2_j$, for a particular variable
The variance inflation factor, $\text{VIF}=\frac{1}{\text{tolerance}}$, for a particular variable
The condition number of the design matrix as a whole:

$$\sqrt{\frac{\text{max(eigenvalue(X'X))}}{\text{min(eigenvalue(X'X))}}}$$

(There are some other options discussed in the Wikipedia article, and here on SO in the context of R.)

The fact that the first three are a perfect function of each other suggests that the only possible net advantage between them would be psychological. On the other hand, the first three allow you to examine variables individually, which might be an advantage, but I have heard that the condition number method is considered best.

Is this true? Best for what?
Is the condition number a perfect function of the $R^2_j$'s? (I would think it would be.)
Do people find that one of them is easiest to explain? (I've never tried to explain these numbers outside of class, I just give a loose, qualitative description of multicollinearity.)

Best Answer

Back in the late 1990s, I did my dissertation on collinearity.

My conclusion was that condition indexes were best.

The main reason was that, rather than look at individual variables, it lets you look at sets of variables. Since collinearity is a function of sets of variables, this is a good thing.

Also, the results of my Monte Carlo study showed better sensitivity to problematic collinearity, but I have long ago forgotten the details.

On the other hand, it is probably the hardest to explain. Lots of people know what $R^2$ is. Only a small subset of those people have heard of eigenvalues. However, when I have used condition indexes as a diagnostic tool, I have never been asked for an explanation.

For much more on this, check out books by David Belsley. Or, if you really want to, you can get my dissertation Multicollinearity diagnostics for multiple regression: A Monte Carlo study

Related Solutions

Solved – Understanding condition index used for finding multicollinearity

Your thinking is basically correct.

Let $Z$ be an $M \times N$ matrix, i.e., $N$ observations of $M$ random variables (or features).

A condition number that "equals infinity" implies that, for any of the $M$ observations, any one of the $N$ variables can be described as a weighted sum of the other $(N-1)$ variables. That defines exact multicollinearity.

Appendix: $\lambda_{min} = 0$ implies that there exists a nonzero eigenvector $q$ such that

$$ZZ^Tq = \lambda_{min}q = 0$$

$$\Rightarrow q^TZZ^Tq = \lambda_{min}q^Tq = \lambda_{min} 1 = 0.$$

Since $0 = q^T ZZ^T q = (Z^T q)^2 \geq 0 $,

$$Z^T q = 0$$

which implies that the nullspace of $Z^T$ is non-trivial.

I realize this doesn't address the case when $0 < \lambda_{min} \ll \lambda_{max}$, i.e. approximate multicollinearity.

Solved – Is multicollinearity implicit in categorical variables

I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.

The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.

Here is an R function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:

trial <- function(n, k1=2, k2=2) {
  df <- expand.grid(1:k1, 1:k2)
  df <- do.call(rbind, lapply(1:n, function(i) df))
  df$y <- rnorm(k1*k2*n)
  fit <- lm(y ~ Var1+Var2, data=df)
  vif(fit)
}

Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:

sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates

This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line

  df <- subset(df, subset=(y < 0))

before the fit line in trial. This removes half the data at random. Re-running

sapply(1:5, function(i) trial(i, 10, 3))

shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10)) produces comparable values.

Best Answer

Related Solutions

Solved – Understanding condition index used for finding multicollinearity

Solved – Is multicollinearity implicit in categorical variables

Related Question