Regression – Calculating Variance Inflation Factor for Logistic Regression Using Statsmodels in Python

logisticpythonregressionstatsmodelsvariance-inflation-factor

I am making a logistic regression model using Statsmodels while following the book "Discovering statistics using R" by Andy Field, Jeremy Miles, and Zoë Field . While following along the example I went on to calculate the VIF to check multicollinearity between variables in logistic regression model using following code:

import pandas as pd
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant



pen_df = pd.read_csv('penalty.csv')
pen_df.drop(['Unnamed: 4'], inplace=True, axis=1)
pen_df['Scoredx']  = pen_df['Scored'].replace({'Scored':1, 'Missed':0})
pen_df = add_constant(pen_df)



p02 = sm.Logit(pen_df['Scoredx'], pen_df[['const', 'PSWQ', 'Previous', 'Anxious']]).fit()


copy_df = pen_df.copy()
copy_df.drop(['Scored','Scoredx'], inplace=True, axis=1)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.Series([variance_inflation_factor(copy_df.values, i) 
               for i in range(1, copy_df.shape[1])], 
              index=copy_df.columns[1:])

print(vif)

Which gave the output enter image description here

However , the output in the book comes as follows enter image description here

Upon going through the answer by Alexander in this post and this_documentation, I come to understand that VIF in statsmodels use OLS and due to that there may be this discrepancy in my answer. I want to know that how to calculate VIF in this case(logit model) using statsmodels or more generally python to match the answer given in the book.

I have added datafile just in the case it may be useful for reproducibility.

Best Answer

Reverse engineering what R package car does for vif of GLM. The computation is based on the covariance of the parameter estimates. It also uses Generalized VIF which is defined for terms instead of single columns of the design matrix. In the example, every term is one column, so this does not make a difference.

A corresponding Python code for the vif for columns based on the estimated model using statsmodels is:

cov = p02.cov_params()
corr = cov / p02.bse / p02.bse[:, None]
np.diag(np.linalg.inv(corr.values[1:, 1:]))[[1, 0, 2]]
​
array([35.22707635,  1.08976625, 35.58192988])

statsmodels currently only has vif based on the original design matrix.

(I have not yet seen a reference for GLM vif that provides background for this computation using covariance of parameter estimates.)