You seem to include the interaction term ub:lb
, but not ub
and lb
themselves as separate predictors. This would violate the so-called "principle of marginality" which states that higher-order terms should only include variables present in lower-order terms (Wikipedia for a start). Effectively, you are now including a predictor that is just the element-wise product of ub
and lb
.
$VIF_{j}$ is just $\frac{1}{1-R_{j}^{2}}$ where $R_{j}^{2}$ is the $R^{2}$ value when you run a regression with your original predictor variable $j$ as criterion predicted by all remaining predictors (it is also the $j$-th diagonal element of $R_{x}^{-1}$, the inverse of the correlation matrix of the predictors). A VIF-value of 50 thus indicates that you get an $R^{2}$ of .98 when predicting ub
with the other predictors, indicating that ub
is almost completely redundant (same for lb
, $R^{2}$ of .97).
I would start doing all pairwise correlations between predictors, and run the aforementioned regressions to see which variables predict ub
and lb
to see if the redundancy is easily explained. If so, you can remove the redundant predictors. You can also look into ridge regression (lm.ridge()
from package MASS
in R).
More advanced multicollinearity diagnostics use the eigenvalue-structure of $X^{t}X$ where $X$ is the design matrix of the regression (i.e., all predictors as column-vectors). The condition $\kappa$ is $\frac{\sqrt{\lambda_{max}}}{ \sqrt{ \lambda_{min}}}$ where $\lambda_{max}$ and $\lambda_{min}$ are the largest and smallest ($\neq 0$) eigenvalues of $X^{t}X$. In R, you can use kappa(lm(<formula>))
, where the lm()
model typically uses the standardized variables.
Geometrically, $\kappa$ gives you an idea about the shape of the data cloud formed by the predictors. With 2 predictors, the scatterplot might look like an ellipse with 2 main axes. $\kappa$ then tells you how "flat" that ellipse is, i.e., is a measure for the ratio of the length of largest axis to the length of the smallest main axis. With 3 predictors, you might have a cigar-shape, and 3 main axes. The "flatter" your data cloud is in some direction, the more redundant the variables are when taken together.
There are some rules of thumb for uncritical values of $\kappa$ (I heard less than 20). But be advised that $\kappa$ is not invariant under data transformations that just change the unit of the variables - like standardizing. This is unlike VIF: vif(lm(y ~ x1 + x2))
will give you the same result as vif(lm(scale(y) ~ scale(x1) + scale(x2)))
(as long as there are not multiplicative terms in the model), but kappa(lm(y ~ x1 + x2))
and kappa(lm(scale(y) ~ scale(x1) + scale(x2)))
will almost surely differ.
Best Answer
The VIF in package car is computing a generalised VIF (GVIF), which aims to account for the fact that multiple columns in the model matrix and multiple coefficients may be associated with a single covariate in the model (think polynomial terms).
It produces gibberish, however, for models estimated via
mgcv::gam()
as it fails to identify properly all the terms that include a given covariate.The
mgcv.helper::vif.gam()
function computes a classic VIF for the parametric terms in a model only.To illustrate, I modify an example from the mgcv.helper
README.md
Now we can compute the VIFs using car and mgcv.help:
As you can see, the
car::vif.default()
gets this all wrong as it't can't identify even the parametric terms in the model.vif.gam()
is computing the classical VIF for the parametric terms only.However, for GAMs, checking for colinearity is not sufficient. As we're now fitting smooth functions we need to determine whether the smooth function of one variable can be produced using a combination of the smooths of the other terms in the model. I.e. we need to think about concurvity.
As to whether a VIF is too large or not, there are really only rules of thumb to follow. It is up to you, I believe, to decide if the degree of collinearity is too much. A useful way to think of the VIF is as follows: the square root of the VIF for the $j$th covariate indicates by how much the confidence interval for $\hat{\beta}_j$ is expanded relative to uncorrelated data (if such data could exists). So, in the information returned by
vif.gam()
, the confidence interval forx1
is inflated by about 1.4 times, and that forx2
by 2.2 times.But this of course is not considering any collinearity with
x3
, which is also estimated to a linear term (it uses 1 degree of freedom). If we refit with a linear term forx3
and recompute the VIFs we getwhich doesn't change the VIFs for the other two variables of course, but does highlight the limitations of only considering the strictly parametric terms.
If we use
concurvity()
from the mgcv package, we note the strong collinearity forx3
:as well as that of the parametric terms, but we are not able to separate the collinearity in the parametric terms into individual estimates for
x1
andx2
.