Logistic Regression – Understanding the Relative Importance of Predictors in Logistic Regression

importancelogisticpredictorrregression

I would like to calculate an estimate (even a very rough one if it is the best I could get) of the relative importance of predictors in a logistic regression, something which can let me tell a common person, not proficient in statistics (just like me), for example: these are the predictors: x1, x2, x3, x4, They are all significant statistically, but as you can see, x2 is more important than x1, x3 and x4, because the value of "unknown datum" is higher than "unknown value/other predictor".

I read Relative importance of predictors in the final model, importance of each predictor in logistic regression, How to quantify the Relative Variable Importance in Logistic Regression in terms of p?, but I did not read what I am looking for.
I use R and I know a package for it: caret (https://github.com/topepo/caret/) and its function called 'varImp', but I can not understand how the absolute value of t-statistic is used, so I can not understand how to comment values obtained through this formula.

The main question is: how can I tell, in practice, that one predictor of a logistic regression is more important than another?
Secondarily: how can I tell how much a predictor in that logistic regression is more important than another? Can you explain to me, a statistician wannabe, how the absolute value of t-statistic could be helpful?

Best Answer

I assume all predictors have been standardized (thus, centered and scaled by the sample standard deviations).

Let $\mathbf{x}$ be the vector of predictors and $y$ the response, conditionally Bernoulli-distributed wrt $\mathbf{x}$. Then if $\mu=\mathbb{E[y|\mathbf{x}]}=p(y=1|\mathbf{x})$, then clearly

$$\frac{\partial \mu}{\partial x_i}=\beta_i \frac{\exp{(-\beta_0-\boldsymbol{\beta}^T\cdot \mathbf{x})}}{(1+\exp{(-\beta_0-\boldsymbol{\beta}^T\cdot \mathbf{x})})^2}$$

measures the effect of $x_i$ on $\mu$. This effect is a function of $\mathbf{x}$. However, the relative importance of two predictors is

$$\frac{\frac{\partial \mu}{\partial x_i}}{\frac{\partial \mu}{\partial x_j}}=\frac{\beta_i}{\beta_j}$$

which is independent of $\mathbf{x}$. Thus, provided we have standardized all predictors, we can look at the estimates of the model coefficients as indicators of the relative importance of the predictors for what it concerns the variation of the output.

As an example application, I will adapt the case in section 4.3.4 of An Introduction to Statistical Learning, by James, Witten, Hastie & Tibshirani. Suppose you have a database Default of default rates for credit card owners, with predictors student (categorical), income and credit card balance (continuous). Standardize the predictors and fit a logistic regression model. Now you can use the relative magnitude of the $\hat{\beta}_j$to decide which predictor has a larger effect on probability of default. This helps the credit card company decide to whom they should offer credit, which categories are the more risky, which customer segment to target with an ad campaign and so on.

Finally, this paper lists six different definitions of relative predictor importance for logistic regression.

The first one is very similar to the one I showed, with the only difference that instead of standardizing the predictors before, they standardize the $\hat{\beta}_j$ after estimation by multiplying by the ratio $\frac{s_j}{s_y}$ where $s_y$ is the response sample standard deviation, and $s_j$ is the sample standard deviation of predictor $x_j$. It's not exactly the same as my suggestion, because the estimators for the logistic regression coefficients are nonlinear functions of the data, but the idea is similar.

The second one (using the $p-$values from the Wald $\chi^2$ test) is flawed, as explained by @MatthewDrury in the comments to the OP, and shouldn't be used.

Third one (logistic pseudo partial correlation) can be a good choice as long , instead of the Wald $\chi^2$ statistic, in the numerator of the pseudo partial correlation we use the ratio of the likelihood of the model with just predictor $x_i$, to that of the full model. I cannot comment on the other approaches since I don't know enough about them.

Suggestions

You could perform individual multiple regressions for each type of predictor, and compare across multiple regressions, adjusted r-square, generalised r-square, or some other parsimony adjusted measure of variance explained.
You could alternatively explore the general literature on variable importance (see here for a discussion with links). This would encourage a focus on the importance of individual predictors.
In some situations hierarchical regression may provide a useful framework. You would enter one type of variable in one block (e.g., cognitive variables), and in the second block another type (e.g., social variables). This would help answer the question of whether one type of variable predicts over and above another type.
As a side examination, you could run a factor analysis on the predictor variables to examine whether the correlations between predictor variables map on to the assignment of variables to types.

Caveats

Types of variables such as cognitive, social, and behavioural are broad classes of variables. A given study will always include only a subset of the possible variables, and typically such a subset is small relative to the possible variables. Furthermore, the measured variables may not be the most reliable or valid means of measuring the intended construct. Thus, you need to be careful when drawing the broader inference about the relative importance of a given type of variable over and beyond what was actually measured.
You also need to consider any bias in the way that the dependent variable was measured. Particularly in psychological studies, there is a tendency for self-report measures to correlate well with self-report, ability with ability, other-report with other report, and so on. The issue is that the mode of measurement has a large effect over and beyond the actual construct of interest. Thus, if the dependent variable is measured in a particular way (e.g., self-report), then don't over-interpret larger correlations with one type of predictor if that type also uses self-report.

Solved – importance of each predictor in logistic regression

Doing the chi-square test in R

Well, it’s very easy to do:

> drop1(mtcars_log_reg, test = "LRT") # or test = "Chisq"

Single term deletions

Model:
vs ~ drat + mpg
       Df Deviance    AIC     LRT  Pr(>Chi)    
<none>      25.351 31.351                      
drat    1   25.533 29.533  0.1825 0.6692659    
mpg     1   37.159 41.159 11.8079 0.0005898 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

But it doesn’t answer the question of which predictors are the ‘most important’ (which is in general a very difficult question to answer, and it may not even be a meaningful question).

Note that usually the results are very similar to the Wald tests you get with summary():

> summary(mtcars_log_reg)

Call:
glm(formula = vs ~ drat + mpg, family = "binomial", data = mtcars)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.1297  -0.5415  -0.1981   0.6142   1.8299  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)  
(Intercept)  -7.7555     4.0002  -1.939   0.0525 .
drat         -0.5619     1.3327  -0.422   0.6733  
mpg           0.4770     0.2005   2.380   0.0173 *
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

What are we really doing here?

To keep the answer more statistical and less about R, let me explain what we are doing here. For each variable var, we are comparing a logistic model containing all predictors with a similar model containing all variables except var. The two models are nested, so we can use a likelihood ratio test (LRT).

Basically, twice the difference in log likelihood between the two models will have an approximate chi-square distribution with degrees of freedom equal to the difference in degrees in freedom for the two models. For the variable mpg, we have

# Fit the full and simplified models
> mod.full = glm(vs ~ drat + mpg, data = mtcars, family = binomial)
> mod.simple = glm(vs ~ drat, data = mtcars, family = binomial)

# Extract the log likelihood values for each model
> ( ll.full = logLik(mod.full) )
'log Lik.' -12.67544 (df=3)
> ( ll.simple = logLik(mod.simple) )
'log Lik.' -18.5794 (df=2)

# Calculate the likelihood ratio test statistic
> ( lrt.stat = c(2*(ll.full - ll.simple)) )
[1] 11.80793

# Calculate the P-value
> pchisq(lrt.stat, df = 3-2, lower.tail = FALSE)
[1] 0.0005897904

which are the same test statistics and P-values as above (with a few extra digits shown).

Note that when the predictors are continuous, the difference in degrees of freedom is always 1, but when they are factors (categorical variables), the difference correspond to the number of extra dummy variables (parameter estimates in the R output) used in the full model.

Best Answer

Related Solutions

Solved – Comparing importance of different sets of predictors

Suggestions

Caveats

Solved – importance of each predictor in logistic regression

Doing the chi-square test in R

What are we really doing here?

Related Question