Solved – Understanding which features were most important for logistic regression

feature selectionimportancelogisticmachine learning

I've built a logistic regression classifier that is very accurate on my data. Now I want to understand better why it is working so well. Specifically, I'd like to rank which features are making the biggest contribution (which features are most important) and, ideally, quantify how much each feature is contributing to the accuracy of the overall model (or something in this vein). How do I do this?

My first thought was to rank them based on their coefficient, but I suspect this can't be right. If I have two features that are equally useful, but the spread of the first is ten times as large as the second, then I'd expect the first to receive a lower coefficient than the second. Is there a more reasonable way to evaluate feature importance?

Note that I'm not trying to understand how much a small change in the feature affects the probability of the outcome. Rather, I'm trying to understand how valuable each feature is, in terms of making the classifier accurate. Also, my goal is not so much to perform feature selection or construct a model with fewer features, but to try to provide some "explainability" for the learned model, so the classifier isn't just an opaque black-box.

Best Answer

The first thing to note is that you don't use logistic regression as a classifier. The fact that $Y$ is binary has absolutely nothing to do with using this maximum likelihood method to actually classify observations. Once you get past that, concentrate on the gold standard information measure which is a by-product of maximum likelihood: the likelihood ratio $\chi^2$ statistic. You can produce a chart showing the partial contribution of each predictor in terms of its partial $\chi^2$ statistic. These statistics have maximum information/power. You can use the bootstrap to show how hard it is to pick "winners" and "losers" by getting confidence intervals on the ranks of the predictive information provided by each predictor once the other predictors are accounted for. An example is in Section 5.4 of my course notes - click on Handouts.

If you have highly correlated features you can do a "chunk test" to combine their influence. A chart that does this is given in Figure 15.11 where size represents the combined contribution of 4 separate predictors.

Related Question