Solved – reason to prefer a specific measure of multicollinearity

multicollinearity

When working with many input variables, we are often concerned about multicollinearity. There are a number of measures of multicollinearity that are used to detect, think about, and / or communicate multicollinearity. Some common recommendations are:

  1. The multiple $R^2_j$ for a particular variable

  2. The tolerance, $1-R^2_j$, for a particular variable

  3. The variance inflation factor, $\text{VIF}=\frac{1}{\text{tolerance}}$, for a particular variable

  4. The condition number of the design matrix as a whole:

    $$\sqrt{\frac{\text{max(eigenvalue(X'X))}}{\text{min(eigenvalue(X'X))}}}$$

(There are some other options discussed in the Wikipedia article, and here on SO in the context of R.)

The fact that the first three are a perfect function of each other suggests that the only possible net advantage between them would be psychological. On the other hand, the first three allow you to examine variables individually, which might be an advantage, but I have heard that the condition number method is considered best.

  • Is this true? Best for what?
  • Is the condition number a perfect function of the $R^2_j$'s? (I would think it would be.)
  • Do people find that one of them is easiest to explain? (I've never tried to explain these numbers outside of class, I just give a loose, qualitative description of multicollinearity.)

Best Answer

Back in the late 1990s, I did my dissertation on collinearity.

My conclusion was that condition indexes were best.

The main reason was that, rather than look at individual variables, it lets you look at sets of variables. Since collinearity is a function of sets of variables, this is a good thing.

Also, the results of my Monte Carlo study showed better sensitivity to problematic collinearity, but I have long ago forgotten the details.

On the other hand, it is probably the hardest to explain. Lots of people know what $R^2$ is. Only a small subset of those people have heard of eigenvalues. However, when I have used condition indexes as a diagnostic tool, I have never been asked for an explanation.

For much more on this, check out books by David Belsley. Or, if you really want to, you can get my dissertation Multicollinearity diagnostics for multiple regression: A Monte Carlo study