Solved – Variance Inflation Factor less than 1 in ridge regression

multicollinearityregressionridge regressionself-studyvariance-inflation-factor

I was trying to determine the biasing constant in ridge regression when I came across a phenomenon that seems quite puzzling, to me at least. I let the GCV criterion choose a constant for me and then I got the Variance Inflation Factors of the new model by computing

$$ \left( \mathbf{R_{XX}} +c\mathbf{I} \right)^{-1} \mathbf{R_{XX}} \left( \mathbf{R_{XX}} +c\mathbf{I} \right)^{-1} $$

and extracting the diagonal elements of this matrix. What I found puzzling was the fact that these VIFs were very close to zero. It seems to me that that would require negative $R^2$s, no? I know that this can happen occasionally, for example in Regression Through the Origin, but I cannot quite justify it in this context.

I am wondering then, what does a VIF close to zero mean? Then, would my choice of this constant be acceptable or should I look for another solution that keeps the VIFs close to 1, as they ought to be in the absence of multicollinearity?

Best Answer

I would like to suggest that you calculate the diagonal elements of matrix directly.

It is assumed that the design matrix is centered and scaled.

We can adopt the eigen value decomposition $R_{XX}=X'X=T\Lambda T'$.

$\begin{align} (R_{XX}+cI)^{-1}R_{XX}(R_{XX}+cI)^{-1}&=(R_{XX}+cI)^{-1}(R_{XX}+cI)(R_{XX}+cI)^{-1}-c(R_{XX}+cI)^{-1}(R_{XX}+cI)^{-1}\\ &=(R_{XX}+cI)^{-1}-c(R_{XX}+cI)^{-1}(R_{XX}+cI)^{-1} \\ &=(T\Lambda T'+cTT')^{-1}-c(T\Lambda T'+cTT')^{-1}(T\Lambda T'+cTT')^{-1}\\ &=T\left( (\Lambda+cI)^{-1}-c (\Lambda+cI)^{-1} (\Lambda+cI)^{-1} \right)T' \end{align}$

The matrix $ (\Lambda+cI)^{-1}$ is a diagonal matrix that its $i$th element is $\frac{1}{\lambda_i+c}$.

So the matrix $(\Lambda+cI)^{-1}-c (\Lambda+cI)^{-1} (\Lambda+cI)^{-1}$ is also a diagonal matrix and its ith element is $\frac{\lambda_i}{(\lambda_i+c)^2}$.

In OLS, it is known that vif values are the diagonal elements of the matrix $T\Lambda^{-1}T'$. Comparing this $\Lambda^{-1}$ matrix with the corresponding of ridge$(\Lambda+cI)^{-1}-c (\Lambda+cI)^{-1} (\Lambda+cI)^{-1}$, every diagonal elements of the ridge case are deflated by the factor $\frac{\lambda_i^2}{(\lambda_i+c)^2}$.

I guess now we can conclude the bigger the ridge constant, we would get the more deflated VIFs.

I am not a native English speaker. Please don't mind my awkward expressions and it would be nice of you if you correct my grammar errors. Thank you.

Related Solutions

Solved – Is multicollinearity implicit in categorical variables

I cannot reproduce exactly this phenomenon, but I can demonstrate that VIF does not necessarily increase as the number of categories increases.

The intuition is simple: categorical variables can be made orthogonal by suitable experimental designs. Therefore, there should in general be no relationship between numbers of categories and multicollinearity.

Here is an R function to create categorical datasets with specifiable numbers of categories (for two independent variables) and specifiable amount of replication for each category. It represents a balanced study in which every combination of category is observed an equal number of times, $n$:

trial <- function(n, k1=2, k2=2) {
  df <- expand.grid(1:k1, 1:k2)
  df <- do.call(rbind, lapply(1:n, function(i) df))
  df$y <- rnorm(k1*k2*n)
  fit <- lm(y ~ Var1+Var2, data=df)
  vif(fit)
}

Applying it, I find the VIFs are always at their lowest possible values, $1$, reflecting the balancing (which translates to orthogonal columns in the design matrix). Some examples:

sapply(1:5, trial) # Two binary categories, 1-5 replicates per combination
sapply(1:5, function(i) trial(i, 10, 3)) # 30 categories, 1-5 replicates

This suggests the multicollinearity may be growing due to a growing imbalance in the design. To test this, insert the line

  df <- subset(df, subset=(y < 0))

before the fit line in trial. This removes half the data at random. Re-running

sapply(1:5, function(i) trial(i, 10, 3))

shows that the VIFs are no longer equal to $1$ (but they remain close to it, randomly). They still do not increase with more categories: sapply(1:5, function(i) trial(i, 10, 10)) produces comparable values.

Solved – VIF (Variance Inflation Factor) and correlation in linear regression

No. In this particular case with two independent variables it is not possible.

$Y = \beta_1 * X_1 + \beta_2 * X_2 * \epsilon$

The VIF is calculated as a three step procedure

Running an OLS from $X_2$ on $X_1$

$X_1$ = $c_0$ + $\alpha * X_2$ + $\epsilon$

Calculate the VIF

$VIF_i$ = $\frac{1}{1-R^2_{i}}$

Analyze the VIF. What is a large VIF. Some people say >4, some >10, some >15.

While the correlation is computed in the following way.

$\rho_{x,y}$ = $corr(x,y)$ = $\frac{cov(x,y)}{\rho_{x}\rho{y}}$ = $\frac{E[(X-\mu_x)(Y-\mu_y)]}{\rho_x \rho_y}$

You should not worry if the correlation is between -0.5 and 0.5. Some people even say that a correlation between -0.8/-0.7 and 0.7/0.8 is no major problem.

You should see that both measures only represent a linear relationship between $X_1$ and $X_2$. So they cannot yield completely different measures.

If the correlation and the VIF are somewhat contradictory I propose the following procedures.

What if you eliminate a variable? Do these regression yield to different results? If yes, there might be correlation.

$Y = \beta_1 X_1 + \epsilon$

$Y = \beta_2 X_2 + \epsilon$

Apply a ridge regression which is more robust to multicollinearity than an OLS regression. IF results differ there might be multicollinearity.
Are the variables logically related? e.g. If the two variables are weight and height of people than you already know without a regression that presumably tall people are heavier.

Best Answer

Related Solutions

Solved – Is multicollinearity implicit in categorical variables

Solved – VIF (Variance Inflation Factor) and correlation in linear regression

Related Question