Solved – What are the merits of different approaches to detecting collinearity

least squaresmulticollinearitymultiple regressionreferencesvariance-inflation-factor

I want to detect whether collinearity is a problem in my OLS regression. I understand that variance inflation factors and the condition index are two commonly used measures, but am finding it difficult to find anything definite on the merits of each approach, or what the scores should be.

A prominent source which indicates what approach to do, and/or what scores are appropriate would be very useful.

A similar question was asked at "Is there a reason to prefer a specific measure of multicollinearity?" but I'm ideally after a reference that I can cite .

Best Answer

Belsley, Kuh, and Welsch is the text to go to for this kind of question. They include extensive discussion of older diagnostics in a section entitled "Historical Perspective". Concerning VIF they write

... If we assume the $X$ data have been centered and scaled to have unit length, the correlation matrix $R$ is simply $X^\prime X$. ...

We are considering $R^{-1} = (X^\prime X)^{-1}$. The diagonal elements of $R^{-1}$, the $r^{ii}$, are often called the variance inflation factors, $\text{VIF}_i$, and their diagnostic value follows from the relation $$\text{VIF}_i = \frac{1}{1 - R_i^2}$$ where $R_i^2$ is the multiple correlation coefficient of $X_i$ regressed on the remaining explanatory variables. Clearly a high VIF indicates an $R_i^2$ near unity, and hence points to collinearity. This measure is therefore of some use as an overall indication of collinearity. Its weaknesses, like those of $R$, lie in its inability to distinguish among several coexisting near dependencies and in the lack of a meaningful boundary to distinguish between values of VIF that can be considered high and those that can be considered low.

In place of analyzing $R$ (or $R^{-1}$), BKW propose careful, controlled examination of the Singular Value Decomposition of $X$. They motivate it by demonstrating that the ratio of the largest to the smallest singular values is the condition number of $X$ and show how the condition number provides (at times tight) bounds on the the propagation of computing errors in the calculation of the regression estimates. They go on to attempt an approximate decomposition of the variances of the parameter estimates $\hat\beta_i$ into components associated with the singular values. The power of this decomposition lies in its ability (in many cases) to reveal the nature of the collinearity, rather than just indicating its presence.

Anyone who has built regression models with hundreds of variables will appreciate this feature! It is one thing for the software to say "your data are collinear, I cannot proceed" or even to say "your data are collinear, I'm throwing out the following variables." It is altogether a much more useful thing for it to be able to say "the group of variables $X_{i_1}, \ldots, X_{i_k}$ is causing instabilities in the calculations: see which of those variables you can do without or consider performing a principal components analysis to reduce their number."

Ultimately, BKW recommend diagnosing collinearity by means of

... the following double condition:

A singular value judged to have a high condition index, and which is associated with

High variance-decomposition proportions for two or more estimated regression coefficient variances.

The number of condition indexes deemed large (say, greater than $30$) in (1) identifies the number of near dependencies among the columns of the data matrix $X$, and the magnitudes of these high condition indexes provide a measure of their relative "tightness." Furthermore, the determination in (2) of large variance-decomposition proportions (say, greater than $0.5$) associated with each high condition index identifies those variates that are involved in the corresponding near dependency, and the magnitude of these proportions in conjunction with the high condition index provides a measure of the degree to which the corresponding regression estimate has been degraded by the presence of collinearity.

Related Solutions

Solved – The reference book for statistics with R – does it exist and what should it contain

I personally thought that Modern Applied Statistics with S-Plus ticks all of the boxes you have outlined. Every example has R code, they give good references to other sources, and Venables and Ripley have a wonderfully terse and explanatory writing style which I really appreciated. I tend to re-read the book every so often, and each time I get more from it. Of course, your mileage may vary.

Solved – Is multicollinearity implicit in categorical variables

I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.

The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.

Here is an R function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:

trial <- function(n, k1=2, k2=2) {
  df <- expand.grid(1:k1, 1:k2)
  df <- do.call(rbind, lapply(1:n, function(i) df))
  df$y <- rnorm(k1*k2*n)
  fit <- lm(y ~ Var1+Var2, data=df)
  vif(fit)
}

Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:

sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates

This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line

  df <- subset(df, subset=(y < 0))

before the fit line in trial. This removes half the data at random. Re-running

sapply(1:5, function(i) trial(i, 10, 3))

shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10)) produces comparable values.

Best Answer

Related Solutions

Solved – The reference book for statistics with R – does it exist and what should it contain

Solved – Is multicollinearity implicit in categorical variables

Related Question