Solved – Relative importance of categorical variable in logistic regression

importance

I would like to rank variables of a logistic regression model on the basis of their predictive importance.

The model has both categorical and continuous variables.

For this purpose, is it okay to assign say 1,2,3,4….. values to categories of a categorical variable and treat it as a continuous variable and then standardize it along with other continuous variables and get standardised estimates from logistic regression using the standardized variables as input to the model?

If the purpose is to find relative importance of variables of an already built model, is this approach alright?

Best Answer

While you can mess around with pseudo-R2s, I have never found them to be very informative or useful in a logit model. You also run into other problems when you compare logit models with different coefficients (I don't have an immediate reference but if you Google or look at CV for logit scaling factor you should get an idea).

Here are a couple of alternative approaches:

  • Estimate average marginal effects. You can use standardized continuous variables, but comparing continuous to categorical variables is inherently difficult because they are not on comparable scales. The only way I would feel comfortable making a categorical variable continuous is if you test the porportional odds assumption (that an increase from 1 to 2 is comparable to 2 to 3, etc.). Even then, you may gloss over information when you make it continuous. I would run the model both ways, and look at average marginal effects/predicted probabilities with the categorical variable set at the same levels for the categorical and continuous case, and see how the results differ. Also map it out to see how the predicted probabilities change based on different levels of the variables.

If you are unable to convince yourself and others that your categorical variables can be continuous, then you have a harder task. You could estimate predicted probabilities of the quantiles or deciles of the continuous variables, and compare them to the categorical variable.

  • Look at the predictive ability of your model. Look into the various metrics of specificity, sensitivity, area under the ROC curve, etc., and see how your prediction changes based on the different variables.

In the end, because there isn't a direct way to do this in a logit model, I would approach this more than one way and see if all the methods triangulate together. If they do, you're golden. If not the story is more nuanced and will take more thinking.