Solved – VIF for generalized linear model

generalized linear modelmulticollinearityrvariance-inflation-factor

Is the variance inflation factor useful for GLM models. Below example shows OLS is showing VIF>5, but GLM lower. GLM shows instability in the coefficients between train and test set.

> library(MASS)
> set.seed(1)
> mu <- rep(0,4)
> Sigma <- matrix(.9, nrow=4, ncol=4) 
> diag(Sigma) <- 1
> rawvars <- mvrnorm(n=1000, mu=mu, Sigma=Sigma)
> d  <- as.ordered( as.numeric(rawvars[,1]>0.5) )
> d[1:200]  <- 1
> # 
> df  <- data.frame(rawvars, d)
> 
> ind  <- sample(1:nrow(df), 500)
> train  <- df[ind,]
> test  <- df[-ind,]
> 
> fit.lm = lm(X4~X1+X2+X3, train)
> 
> fit.lm.test = lm(X4~X1+X2+X3, test)
> summary(fit.lm)

Call:
lm(formula = X4 ~ X1 + X2 + X3, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.16108 -0.24883  0.01356  0.25803  1.43455 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02638    0.01698   1.554    0.121    
X1           0.30456    0.04243   7.178  2.6e-12 ***
X2           0.33006    0.04285   7.703  7.3e-14 ***
X3           0.33138    0.04477   7.402  5.8e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3788 on 496 degrees of freedom
Multiple R-squared:  0.8553,    Adjusted R-squared:  0.8544 
F-statistic: 977.1 on 3 and 496 DF,  p-value: < 2.2e-16

> summary(fit.lm.test)

Call:
lm(formula = X4 ~ X1 + X2 + X3, data = test)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.06717 -0.25019  0.00594  0.26794  1.14282 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.02660    0.01669  -1.594    0.112    
X1           0.33038    0.04216   7.836 2.85e-14 ***
X2           0.35636    0.04044   8.811  < 2e-16 ***
X3           0.27625    0.04125   6.697 5.78e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3727 on 496 degrees of freedom
Multiple R-squared:  0.8806,    Adjusted R-squared:  0.8799 
F-statistic:  1219 on 3 and 496 DF,  p-value: < 2.2e-16

> 
> car::vif(fit.lm)
      X1       X2       X3 
6.260418 6.329868 6.480232 
> car::vif(fit.lm.test)
      X1       X2       X3 
7.252877 7.097506 7.138707 
> 
> fit.glm  <- glm(d~X1+X2+X3, train,            family = binomial(link="probit"))
> fit.glm.test  <- glm(d~X1+X2+X3, test,          family = binomial(link="probit"))
> summary(fit.glm)

Call:
glm(formula = d ~ X1 + X2 + X3, family = binomial(link = "probit"), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7122  -0.8662  -0.2447   0.7068   3.7270  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.08018    0.06551  -1.224   0.2210    
X1           0.78783    0.17028   4.627 3.71e-06 ***
X2           0.31422    0.16544   1.899   0.0575 .  
X3          -0.03526    0.17270  -0.204   0.8382    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 689.62  on 499  degrees of freedom
Residual deviance: 475.49  on 496  degrees of freedom
AIC: 483.49

Number of Fisher Scoring iterations: 5

> summary(fit.glm.test)

Call:
glm(formula = d ~ X1 + X2 + X3, family = binomial(link = "probit"), 
    data = test)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4552  -0.9145  -0.3013   0.7889   2.8701  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.10849    0.06398  -1.696   0.0899 .  
X1           0.77867    0.16649   4.677 2.91e-06 ***
X2           0.01052    0.15355   0.068   0.9454    
X3           0.09631    0.15826   0.609   0.5428    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 690.83  on 499  degrees of freedom
Residual deviance: 505.10  on 496  degrees of freedom
AIC: 513.1

Number of Fisher Scoring iterations: 5

> 
> car::vif(fit.glm)
      X1       X2       X3 
3.826678 3.734214 4.022859 
> car::vif(fit.glm.test)
      X1       X2       X3 
4.633032 4.617873 4.717568 

Best Answer

If we look at the function

library(car)
getS3method("vif", "default")
#R function (mod, ...) 
#R {
#R     v <- vcov(mod)
#R     assign <- attr(model.matrix(mod), "assign")
#R     [...]
#R     terms <- labels(terms(mod))
#R     n.terms <- length(terms)
#R     [...]
#R     R <- cov2cor(v)
#R     detR <- det(R)
#R     result <- matrix(0, n.terms, 3)
#R     rownames(result) <- terms
#R     colnames(result) <- c("GVIF", "Df", "GVIF^(1/(2*Df))")
#R     for (term in 1:n.terms) {
#R      subs <- which(assign == term)
#R      result[term, 1] <- det(as.matrix(R[subs, subs])) * det(as.matrix(R[-subs, 
#R          -subs]))/detR
#R      result[term, 2] <- length(subs)
#R     }
#R     if (all(result[, 2] == 1)) 
#R      result <- result[, 1]
#R     else result[, 3] <- result[, 1]^(1/(2 * result[, 2]))
#R     result
#R }

then it calls vcov which will differ for a glm then lm. In the glm case it depends on the outcome. Thus, you get the different results. All the above is consistent with the 1992 article

Fox, J., & Monette, G. (1992). Generalized collinearity diagnostics. Journal of the American Statistical Association, 87(417), 178-183.

in the linear model case. See particularly Equation (10) and

#R      result[term, 1] <- det(as.matrix(R[subs, subs])) * det(as.matrix(R[-subs, 
#R          -subs]))/detR

To the question

Is the variance inflation factor useful for GLM models

Then I gather that the results in the 1992 article may still hold asymptotically. However, some pen and paper is likely need to justify this claim and I am may be wrong.