Solved – importance of each predictor in logistic regression

chi-squared-testlogisticr

I recently read a paper made a logistic regression and used a table like this to summarise the model:

data.frame(predictors = c("drat", "mpg"), "chi squared statistic" = c("x", "x"), "p-value" = c("x", "x"))

  predictors chi.squared.statistic p.value
1       drat                     x       x
2        mpg                     x       x

The data.frame presented the chi-squared and p value (the x's) for each predictor in the logistic regression model, and thus allowed to see at a glance which predictors were most important.

I have made this logistic regression model:

mtcars_log_reg <- glm(vs ~ drat + mpg, mtcars, family = "binomial")

How can I fill in the x's for the above table using R code?

Best Answer

Doing the chi-square test in R

Well, it’s very easy to do:

> drop1(mtcars_log_reg, test = "LRT") # or test = "Chisq"

Single term deletions

Model:
vs ~ drat + mpg
       Df Deviance    AIC     LRT  Pr(>Chi)    
<none>      25.351 31.351                      
drat    1   25.533 29.533  0.1825 0.6692659    
mpg     1   37.159 41.159 11.8079 0.0005898 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

But it doesn’t answer the question of which predictors are the ‘most important’ (which is in general a very difficult question to answer, and it may not even be a meaningful question).

Note that usually the results are very similar to the Wald tests you get with summary():

> summary(mtcars_log_reg)

Call:
glm(formula = vs ~ drat + mpg, family = "binomial", data = mtcars)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1297  -0.5415  -0.1981   0.6142   1.8299  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -7.7555     4.0002  -1.939   0.0525 .
drat         -0.5619     1.3327  -0.422   0.6733  
mpg           0.4770     0.2005   2.380   0.0173 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

What are we really doing here?

To keep the answer more statistical and less about R, let me explain what we are doing here. For each variable var, we are comparing a logistic model containing all predictors with a similar model containing all variables except var. The two models are nested, so we can use a likelihood ratio test (LRT).

Basically, twice the difference in log likelihood between the two models will have an approximate chi-square distribution with degrees of freedom equal to the difference in degrees in freedom for the two models. For the variable mpg, we have

# Fit the full and simplified models
> mod.full = glm(vs ~ drat + mpg, data = mtcars, family = binomial)
> mod.simple = glm(vs ~ drat, data = mtcars, family = binomial)

# Extract the log likelihood values for each model
> ( ll.full = logLik(mod.full) )
'log Lik.' -12.67544 (df=3)
> ( ll.simple = logLik(mod.simple) )
'log Lik.' -18.5794 (df=2)

# Calculate the likelihood ratio test statistic
> ( lrt.stat = c(2*(ll.full - ll.simple)) )
[1] 11.80793

# Calculate the P-value
> pchisq(lrt.stat, df = 3-2, lower.tail = FALSE)
[1] 0.0005897904

which are the same test statistics and P-values as above (with a few extra digits shown).

Note that when the predictors are continuous, the difference in degrees of freedom is always 1, but when they are factors (categorical variables), the difference correspond to the number of extra dummy variables (parameter estimates in the R output) used in the full model.

Related Solutions

Solved – How to do logistic regression subset selection

Stepwise and "all subsets" methods are generally bad. See Stopping Stepwise: Why Stepwise Methods are Bad and what you Should Use by David Cassell and myself (we used SAS, but the lesson applies) or Frank Harrell Regression Modeling Strategies. If you need an automatic method, I recommend LASSO or LAR. A LASSO package for logistic regression is available here, another interesting article is on the iterated LASSO for logistic

Solved – Testing for overdispersion in logistic regression

The approach described requires unnecessary computations. The test statistic is just

sum(residuals(model_binom, type = "deviance")^2)

This is exactly equal to the Pearson $\chi^2$ test statistic for lack of fit, hence it have chi-squared distribution.

Overdispersion as such doesn't apply to Bernoulli data. Large value of $\chi^2$ could indicate lack of covariates or powers, or interactions terms, or data should be grouped. A p-value of 0.79 indicates the test failed to find any problems.