Solved – Logistic Regression in R with multi level categorical variable

logisticrregressionstatistical significance

After using weight of evidence & Information value mechanism, of the 40 odd variables I am left with 8 variables which are highly or moderately significant.
One of the independent variable which is categorical has 60+ categories. This is a very highly predictable variable hence please suggest as to how should I
use this variable in the model.
When I add this variable in the model my null deviance and AIC decreases and makes other predictors loose their predictive power.
Then another model without this variable my null deviance and AIC improves.
What could be the reason. Is this variable collinear with some other predictor.

Please see the syntax: < Without that Categorical Var>

m1.logit<- glm(survey ~ region+ know + repS+ und+ case_status, family = binomial(logit), data = a1 )
m1.logit  
summary(m1.logit)

Call:  
glm(formula = survey ~ region + know + repS + und + case_status, 
    family = binomial(logit), data = a1)  

Deviance Residuals:   
     ` Min       1Q   Median    3Q     Max`
    -2.579    0.271   0.290   0.336   2.895    

Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)
    Null deviance: 2553.5  on 2540  degrees of freedom
Residual deviance: 1287.7  on 2526  degrees of freedom
AIC: 1318    
Number of Fisher Scoring iterations: 13

Also ran an anova test to analyze the table of deviance

anova(m1.logit, test="Chisq")   
Analysis of Deviance Table  

Model: binomial, link: logit  
Response: survey  

Terms added sequentially (first to last)  

             Df Deviance Resid. Df Resid. Dev             Pr(>Chi)     
 NULL                         2540       2554                           
 region       5       13      2535       2540                0.022 *    
 know         1      507      2534       2033 < 0.0000000000000002 ***  
 repS         1      715      2533       1319 < 0.0000000000000002 ***  
 und          1        3      2532       1316                0.109        
 case_status  6       28      2526       1288             0.000078 ***    

 Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1  

Please suggest as to how to deal with this predictor variable with 50+ categories

Best Answer

You can create small bins based on event rate and reduce it say 5-10 bins to make it more stable. This will require the bivariate analysis against the target class and also analyzing the proportion of population among different categories. If after binning, some of your predictors become non-significant, you can remove them. You might need to have multiple iterations to come up with final model selection based on your objective.

Related Question