Logistic Regression – Understanding the Relative Importance of Predictors in Logistic Regression

importancelogisticpredictorrregression

I would like to calculate an estimate (even a very rough one if it is the best I could get) of the relative importance of predictors in a logistic regression, something which can let me tell a common person, not proficient in statistics (just like me), for example: these are the predictors: x1, x2, x3, x4, They are all significant statistically, but as you can see, x2 is more important than x1, x3 and x4, because the value of "unknown datum" is higher than "unknown value/other predictor".

I read Relative importance of predictors in the final model, importance of each predictor in logistic regression, How to quantify the Relative Variable Importance in Logistic Regression in terms of p?, but I did not read what I am looking for.
I use R and I know a package for it: caret (https://github.com/topepo/caret/) and its function called 'varImp', but I can not understand how the absolute value of t-statistic is used, so I can not understand how to comment values obtained through this formula.

  • The main question is: how can I tell, in practice, that one predictor of a logistic regression is more important than another?

  • Secondarily: how can I tell how much a predictor in that logistic regression is more important than another? Can you explain to me, a statistician wannabe, how the absolute value of t-statistic could be helpful?

Best Answer

I assume all predictors have been standardized (thus, centered and scaled by the sample standard deviations).

Let $\mathbf{x}$ be the vector of predictors and $y$ the response, conditionally Bernoulli-distributed wrt $\mathbf{x}$. Then if $\mu=\mathbb{E[y|\mathbf{x}]}=p(y=1|\mathbf{x})$, then clearly

$$\frac{\partial \mu}{\partial x_i}=\beta_i \frac{\exp{(-\beta_0-\boldsymbol{\beta}^T\cdot \mathbf{x})}}{(1+\exp{(-\beta_0-\boldsymbol{\beta}^T\cdot \mathbf{x})})^2}$$

measures the effect of $x_i$ on $\mu$. This effect is a function of $\mathbf{x}$. However, the relative importance of two predictors is

$$\frac{\frac{\partial \mu}{\partial x_i}}{\frac{\partial \mu}{\partial x_j}}=\frac{\beta_i}{\beta_j}$$

which is independent of $\mathbf{x}$. Thus, provided we have standardized all predictors, we can look at the estimates of the model coefficients as indicators of the relative importance of the predictors for what it concerns the variation of the output.

As an example application, I will adapt the case in section 4.3.4 of An Introduction to Statistical Learning, by James, Witten, Hastie & Tibshirani. Suppose you have a database Default of default rates for credit card owners, with predictors student (categorical), income and credit card balance (continuous). Standardize the predictors and fit a logistic regression model. Now you can use the relative magnitude of the $\hat{\beta}_j$to decide which predictor has a larger effect on probability of default. This helps the credit card company decide to whom they should offer credit, which categories are the more risky, which customer segment to target with an ad campaign and so on.

Finally, this paper lists six different definitions of relative predictor importance for logistic regression.

The first one is very similar to the one I showed, with the only difference that instead of standardizing the predictors before, they standardize the $\hat{\beta}_j$ after estimation by multiplying by the ratio $\frac{s_j}{s_y}$ where $s_y$ is the response sample standard deviation, and $s_j$ is the sample standard deviation of predictor $x_j$. It's not exactly the same as my suggestion, because the estimators for the logistic regression coefficients are nonlinear functions of the data, but the idea is similar.

The second one (using the $p-$values from the Wald $\chi^2$ test) is flawed, as explained by @MatthewDrury in the comments to the OP, and shouldn't be used.

Third one (logistic pseudo partial correlation) can be a good choice as long , instead of the Wald $\chi^2$ statistic, in the numerator of the pseudo partial correlation we use the ratio of the likelihood of the model with just predictor $x_i$, to that of the full model. I cannot comment on the other approaches since I don't know enough about them.