Solved – checking multicollinearity with generalized additive model in R

generalized-additive-modelmulticollinearityrvariance-inflation-factor

I am trying to check multicollinearity with GAM using VIF in R.
Should I use vif from the package car ? or Is it right way to check vif using vif.gam from package mgcv::helper:: ?
It gives quite different results and with vif.gam result, it does not show vif results with smoothed variables.
How should I tell whether it has multicollinearity problem or not with vif value ?
Also, I wonder how concurvity and multicollinearity are different?

Best Answer

The VIF in package car is computing a generalised VIF (GVIF), which aims to account for the fact that multiple columns in the model matrix and multiple coefficients may be associated with a single covariate in the model (think polynomial terms).

It produces gibberish, however, for models estimated via mgcv::gam() as it fails to identify properly all the terms that include a given covariate.

The mgcv.helper::vif.gam() function computes a classic VIF for the parametric terms in a model only.

To illustrate, I modify an example from the mgcv.helper README.md

library(mgcv)
library(dplyr)
library(mgcv.helper)

set.seed(101)

N <- 100
x1 <- runif(n=N)
x2 <- runif(n=N)
x3 <- runif(n=N) + 0.9*x1 - 1.75*x2

df <- data.frame(x1 = x1,
                 x2 = x2,
                 x3 = x3) %>%
mutate(y = rnorm(n=N,
                 mean = 1 - 2*x1 + 3*x2 - 0.5*x3,
                 sd = 0.5))

## as per README.md except use a smooth of x3
fit1 <- gam(data=df, y ~ x1 + x2 + s(x3))

Now we can compute the VIFs using car and mgcv.help:

> car:::vif.default(fit1)
       GVIF Df GVIF^(1/(2*Df))
x1 56758.74  0             Inf
x2 56758.74  0             Inf
x3 56758.74  0             Inf
> mgcv.helper::vif.gam(fit1)
# A tibble: 2 x 2
  variable   vif
  <chr>    <dbl>
1 x1        2.05
2 x2        4.90

As you can see, the car::vif.default() gets this all wrong as it't can't identify even the parametric terms in the model. vif.gam() is computing the classical VIF for the parametric terms only.

However, for GAMs, checking for colinearity is not sufficient. As we're now fitting smooth functions we need to determine whether the smooth function of one variable can be produced using a combination of the smooths of the other terms in the model. I.e. we need to think about concurvity.

As to whether a VIF is too large or not, there are really only rules of thumb to follow. It is up to you, I believe, to decide if the degree of collinearity is too much. A useful way to think of the VIF is as follows: the square root of the VIF for the $j$th covariate indicates by how much the confidence interval for $\hat{\beta}_j$ is expanded relative to uncorrelated data (if such data could exists). So, in the information returned by vif.gam(), the confidence interval for x1 is inflated by about 1.4 times, and that for x2 by 2.2 times.

But this of course is not considering any collinearity with x3, which is also estimated to a linear term (it uses 1 degree of freedom). If we refit with a linear term for x3 and recompute the VIFs we get

> mgcv.helper::vif.gam(fit2)
# A tibble: 3 x 2
  variable   vif
  <chr>    <dbl>
1 x1        2.05
2 x2        4.90
3 x3        5.39

which doesn't change the VIFs for the other two variables of course, but does highlight the limitations of only considering the strictly parametric terms.

If we use concurvity() from the mgcv package, we note the strong collinearity for x3:

> concurvity(fit1)
              para     s(x3)
worst    0.8941263 0.8286386
observed 0.8941263 0.8145965
estimate 0.8941263 0.6325963

as well as that of the parametric terms, but we are not able to separate the collinearity in the parametric terms into individual estimates for x1 and x2.