Solved – What are the merits of different approaches to detecting collinearity

least squaresmulticollinearitymultiple regressionreferencesvariance-inflation-factor

I want to detect whether collinearity is a problem in my OLS regression. I understand that variance inflation factors and the condition index are two commonly used measures, but am finding it difficult to find anything definite on the merits of each approach, or what the scores should be.

A prominent source which indicates what approach to do, and/or what scores are appropriate would be very useful.

A similar question was asked at "Is there a reason to prefer a specific measure of multicollinearity?" but I'm ideally after a reference that I can cite .

Best Answer

Belsley, Kuh, and Welsch is the text to go to for this kind of question. They include extensive discussion of older diagnostics in a section entitled "Historical Perspective". Concerning VIF they write

... If we assume the $X$ data have been centered and scaled to have unit length, the correlation matrix $R$ is simply $X^\prime X$. ...

We are considering $R^{-1} = (X^\prime X)^{-1}$. The diagonal elements of $R^{-1}$, the $r^{ii}$, are often called the variance inflation factors, $\text{VIF}_i$, and their diagnostic value follows from the relation $$\text{VIF}_i = \frac{1}{1 - R_i^2}$$ where $R_i^2$ is the multiple correlation coefficient of $X_i$ regressed on the remaining explanatory variables. Clearly a high VIF indicates an $R_i^2$ near unity, and hence points to collinearity. This measure is therefore of some use as an overall indication of collinearity. Its weaknesses, like those of $R$, lie in its inability to distinguish among several coexisting near dependencies and in the lack of a meaningful boundary to distinguish between values of VIF that can be considered high and those that can be considered low.

In place of analyzing $R$ (or $R^{-1}$), BKW propose careful, controlled examination of the Singular Value Decomposition of $X$. They motivate it by demonstrating that the ratio of the largest to the smallest singular values is the condition number of $X$ and show how the condition number provides (at times tight) bounds on the the propagation of computing errors in the calculation of the regression estimates. They go on to attempt an approximate decomposition of the variances of the parameter estimates $\hat\beta_i$ into components associated with the singular values. The power of this decomposition lies in its ability (in many cases) to reveal the nature of the collinearity, rather than just indicating its presence.

Anyone who has built regression models with hundreds of variables will appreciate this feature! It is one thing for the software to say "your data are collinear, I cannot proceed" or even to say "your data are collinear, I'm throwing out the following variables." It is altogether a much more useful thing for it to be able to say "the group of variables $X_{i_1}, \ldots, X_{i_k}$ is causing instabilities in the calculations: see which of those variables you can do without or consider performing a principal components analysis to reduce their number."

Ultimately, BKW recommend diagnosing collinearity by means of

... the following double condition:

  1. A singular value judged to have a high condition index, and which is associated with
  2. High variance-decomposition proportions for two or more estimated regression coefficient variances.

The number of condition indexes deemed large (say, greater than $30$) in (1) identifies the number of near dependencies among the columns of the data matrix $X$, and the magnitudes of these high condition indexes provide a measure of their relative "tightness." Furthermore, the determination in (2) of large variance-decomposition proportions (say, greater than $0.5$) associated with each high condition index identifies those variates that are involved in the corresponding near dependency, and the magnitude of these proportions in conjunction with the high condition index provides a measure of the degree to which the corresponding regression estimate has been degraded by the presence of collinearity.