Solved – VIF for generalized linear model

generalized linear modelmulticollinearityrvariance-inflation-factor

Is the variance inflation factor useful for GLM models. Below example shows OLS is showing VIF>5, but GLM lower. GLM shows instability in the coefficients between train and test set.

> library(MASS)
> set.seed(1)
> mu <- rep(0,4)
> Sigma <- matrix(.9, nrow=4, ncol=4) 
> diag(Sigma) <- 1
> rawvars <- mvrnorm(n=1000, mu=mu, Sigma=Sigma)
> d  <- as.ordered( as.numeric(rawvars[,1]>0.5) )
> d[1:200]  <- 1
> # 
> df  <- data.frame(rawvars, d)
> 
> ind  <- sample(1:nrow(df), 500)
> train  <- df[ind,]
> test  <- df[-ind,]
> 
> fit.lm = lm(X4~X1+X2+X3, train)
> 
> fit.lm.test = lm(X4~X1+X2+X3, test)
> summary(fit.lm)

Call:
lm(formula = X4 ~ X1 + X2 + X3, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.16108 -0.24883  0.01356  0.25803  1.43455 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.02638    0.01698   1.554    0.121    
X1           0.30456    0.04243   7.178  2.6e-12 ***
X2           0.33006    0.04285   7.703  7.3e-14 ***
X3           0.33138    0.04477   7.402  5.8e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3788 on 496 degrees of freedom
Multiple R-squared:  0.8553,    Adjusted R-squared:  0.8544 
F-statistic: 977.1 on 3 and 496 DF,  p-value: < 2.2e-16

> summary(fit.lm.test)

Call:
lm(formula = X4 ~ X1 + X2 + X3, data = test)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.06717 -0.25019  0.00594  0.26794  1.14282 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.02660    0.01669  -1.594    0.112    
X1           0.33038    0.04216   7.836 2.85e-14 ***
X2           0.35636    0.04044   8.811  < 2e-16 ***
X3           0.27625    0.04125   6.697 5.78e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.3727 on 496 degrees of freedom
Multiple R-squared:  0.8806,    Adjusted R-squared:  0.8799 
F-statistic:  1219 on 3 and 496 DF,  p-value: < 2.2e-16

> 
> car::vif(fit.lm)
      X1       X2       X3 
6.260418 6.329868 6.480232 
> car::vif(fit.lm.test)
      X1       X2       X3 
7.252877 7.097506 7.138707 
> 
> fit.glm  <- glm(d~X1+X2+X3, train,            family = binomial(link="probit"))
> fit.glm.test  <- glm(d~X1+X2+X3, test,          family = binomial(link="probit"))
> summary(fit.glm)

Call:
glm(formula = d ~ X1 + X2 + X3, family = binomial(link = "probit"), 
    data = train)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.7122  -0.8662  -0.2447   0.7068   3.7270  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.08018    0.06551  -1.224   0.2210    
X1           0.78783    0.17028   4.627 3.71e-06 ***
X2           0.31422    0.16544   1.899   0.0575 .  
X3          -0.03526    0.17270  -0.204   0.8382    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 689.62  on 499  degrees of freedom
Residual deviance: 475.49  on 496  degrees of freedom
AIC: 483.49

Number of Fisher Scoring iterations: 5

> summary(fit.glm.test)

Call:
glm(formula = d ~ X1 + X2 + X3, family = binomial(link = "probit"), 
    data = test)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4552  -0.9145  -0.3013   0.7889   2.8701  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -0.10849    0.06398  -1.696   0.0899 .  
X1           0.77867    0.16649   4.677 2.91e-06 ***
X2           0.01052    0.15355   0.068   0.9454    
X3           0.09631    0.15826   0.609   0.5428    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 690.83  on 499  degrees of freedom
Residual deviance: 505.10  on 496  degrees of freedom
AIC: 513.1

Number of Fisher Scoring iterations: 5

> 
> car::vif(fit.glm)
      X1       X2       X3 
3.826678 3.734214 4.022859 
> car::vif(fit.glm.test)
      X1       X2       X3 
4.633032 4.617873 4.717568

Best Answer

If we look at the function

library(car)
getS3method("vif", "default")
#R function (mod, ...) 
#R {
#R     v <- vcov(mod)
#R     assign <- attr(model.matrix(mod), "assign")
#R     [...]
#R     terms <- labels(terms(mod))
#R     n.terms <- length(terms)
#R     [...]
#R     R <- cov2cor(v)
#R     detR <- det(R)
#R     result <- matrix(0, n.terms, 3)
#R     rownames(result) <- terms
#R     colnames(result) <- c("GVIF", "Df", "GVIF^(1/(2*Df))")
#R     for (term in 1:n.terms) {
#R      subs <- which(assign == term)
#R      result[term, 1] <- det(as.matrix(R[subs, subs])) * det(as.matrix(R[-subs, 
#R          -subs]))/detR
#R      result[term, 2] <- length(subs)
#R     }
#R     if (all(result[, 2] == 1)) 
#R      result <- result[, 1]
#R     else result[, 3] <- result[, 1]^(1/(2 * result[, 2]))
#R     result
#R }

then it calls vcov which will differ for a glm then lm. In the glm case it depends on the outcome. Thus, you get the different results. All the above is consistent with the 1992 article

Fox, J., & Monette, G. (1992). Generalized collinearity diagnostics. Journal of the American Statistical Association, 87(417), 178-183.

in the linear model case. See particularly Equation (10) and

#R      result[term, 1] <- det(as.matrix(R[subs, subs])) * det(as.matrix(R[-subs, 
#R          -subs]))/detR

To the question

Is the variance inflation factor useful for GLM models

Then I gather that the results in the 1992 article may still hold asymptotically. However, some pen and paper is likely need to justify this claim and I am may be wrong.

Related Solutions

Solved – How to account for overdispersion in a glm with negative binomial distribution

I'm not sure how to correct the p-values. However you can typically examine the mean-variance assumption in a negative binomial regression by looking at the residuals versus fitted values plot.

If this plot of residuals versus fitted values is not (roughly) an amorphous, random cloud of data points, then you can try using quasi-Poisson regression. Another alternative is to construct your own mean-variance relationship using quasi-likelihood.

Hope this helps!

Solved – Accounting for overdispersion in binomial glm using proportions, without quasibinomial

Overdispersion occurs for a number of reasons, but often the case of presence/absence data is because of clustering of observations and correlations between observations.

Taken from Brostrom & Holmberg (2011) Generalised Linear Models with Clustered Data: Fixed and random effects models with glmmML

"Generally speaking, a random effects model is appropriate if the observed clusters may be regarded as a random sample from a (large, possibly infinite) pool of possible clusters. The observed clusters are of no practical interest per se, but the distribution in the pool is. Or this distribution is regarded as a nuisance that needs to be controlled for."

https://cran.r-project.org/web/packages/eha/vignettes/glmmML.pdf

library(lme4) 
library(RVAideMemoire)
Data$obs <- factor(formatC(1:nrow(Data), flag="0", width = 3))
model.glmm <- glmer(cbind(number_pres,number_abs) ~ Var1+Var2+Var3+Var4...+
(1|obs),family = binomial (link = logit),data = Data) 
overdisp.glmer(model.glmm) #Overdispersion for GLMM

Best Answer

Related Solutions

Solved – How to account for overdispersion in a glm with negative binomial distribution

Solved – Accounting for overdispersion in binomial glm using proportions, without quasibinomial

Related Question