Solved – Relative importance of categorical variable in logistic regression

importance

I would like to rank variables of a logistic regression model on the basis of their predictive importance.

The model has both categorical and continuous variables.

For this purpose, is it okay to assign say 1,2,3,4….. values to categories of a categorical variable and treat it as a continuous variable and then standardize it along with other continuous variables and get standardised estimates from logistic regression using the standardized variables as input to the model?

If the purpose is to find relative importance of variables of an already built model, is this approach alright?

Best Answer

While you can mess around with pseudo-R2s, I have never found them to be very informative or useful in a logit model. You also run into other problems when you compare logit models with different coefficients (I don't have an immediate reference but if you Google or look at CV for logit scaling factor you should get an idea).

Here are a couple of alternative approaches:

Estimate average marginal effects. You can use standardized continuous variables, but comparing continuous to categorical variables is inherently difficult because they are not on comparable scales. The only way I would feel comfortable making a categorical variable continuous is if you test the porportional odds assumption (that an increase from 1 to 2 is comparable to 2 to 3, etc.). Even then, you may gloss over information when you make it continuous. I would run the model both ways, and look at average marginal effects/predicted probabilities with the categorical variable set at the same levels for the categorical and continuous case, and see how the results differ. Also map it out to see how the predicted probabilities change based on different levels of the variables.

If you are unable to convince yourself and others that your categorical variables can be continuous, then you have a harder task. You could estimate predicted probabilities of the quantiles or deciles of the continuous variables, and compare them to the categorical variable.

Look at the predictive ability of your model. Look into the various metrics of specificity, sensitivity, area under the ROC curve, etc., and see how your prediction changes based on the different variables.

In the end, because there isn't a direct way to do this in a logit model, I would approach this more than one way and see if all the methods triangulate together. If they do, you're golden. If not the story is more nuanced and will take more thinking.

Related Solutions

GLMNET – Determining Variable Importance in Logistic Regression

As far as I know glmnet does not calculate the standard errors of regression coefficients (since it fits model parameters using cyclic coordinate descent). So, if you need standardized regression coefficients, you will need to use some other method (e.g. glm)

Having said that, if the explanatory variables are standardized before the fit and glmnet is called with "standardize=FALSE", then the less important coefficients will be smaller than the more important ones - so you could rank them just by their magnitude. This becomes even more pronounced with non-trivial amount shrinkage (i.e. non-zero lambda)

Hope this helps..

Solved – How to quantify the Relative Variable Importance in Logistic Regression in terms of p

For linear models you can use the absolute value of the t-statistics for each model parameter.

Also, you can use something like a random forrest and get a very nice list of feature importances.

If you are using R check out (http://caret.r-forge.r-project.org/varimp.html), if you are using python check out (http://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html#example-ensemble-plot-forest-importances-py)

EDIT:

Since logit has no direct way to do this you can use a ROC curve for each predictor.

For classification, ROC curve analysis is conducted on each predictor. For two class problems, a series of cutoffs is applied to the predictor data to predict the class. The sensitivity and specificity are computed for each cutoff and the ROC curve is computed. The trapezoidal rule is used to compute the area under the ROC curve. This area is used as the measure of variable importance

An example of how this works in R is:

library(caret)
mydata <- data.frame(y = c(1,0,0,0,1,1),
                 x1 = c(1,1,0,1,0,0),
                 x2 = c(1,1,1,0,0,1),
                 x3 = c(1,0,1,1,0,0))

fit <- glm(y~x1+x2+x3,data=mydata,family=binomial())
summary(fit)

varImp(fit, scale = FALSE)

Best Answer

Related Solutions

GLMNET – Determining Variable Importance in Logistic Regression

Solved – How to quantify the Relative Variable Importance in Logistic Regression in terms of p

Related Question