After using weight of evidence & Information value mechanism, of the 40 odd variables I am left with 8 variables which are highly or moderately significant.
One of the independent variable which is categorical has 60+ categories. This is a very highly predictable variable hence please suggest as to how should I
use this variable in the model.
When I add this variable in the model my null deviance and AIC decreases and makes other predictors loose their predictive power.
Then another model without this variable my null deviance and AIC improves.
What could be the reason. Is this variable collinear with some other predictor.
Please see the syntax: < Without that Categorical Var>
m1.logit<- glm(survey ~ region+ know + repS+ und+ case_status, family = binomial(logit), data = a1 )
m1.logit
summary(m1.logit)
Call:
glm(formula = survey ~ region + know + repS + und + case_status,
family = binomial(logit), data = a1)
Deviance Residuals:
` Min 1Q Median 3Q Max`
-2.579 0.271 0.290 0.336 2.895
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2553.5 on 2540 degrees of freedom
Residual deviance: 1287.7 on 2526 degrees of freedom
AIC: 1318
Number of Fisher Scoring iterations: 13
Also ran an anova test to analyze the table of deviance
anova(m1.logit, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: survey
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 2540 2554
region 5 13 2535 2540 0.022 *
know 1 507 2534 2033 < 0.0000000000000002 ***
repS 1 715 2533 1319 < 0.0000000000000002 ***
und 1 3 2532 1316 0.109
case_status 6 28 2526 1288 0.000078 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Please suggest as to how to deal with this predictor variable with 50+ categories
Best Answer
You can create small bins based on event rate and reduce it say 5-10 bins to make it more stable. This will require the bivariate analysis against the target class and also analyzing the proportion of population among different categories. If after binning, some of your predictors become non-significant, you can remove them. You might need to have multiple iterations to come up with final model selection based on your objective.