Solved – Regression using dumthe variables

categorical dataregression

I am working on a credit scoring modelling project and we decided to use dummy variables for regression. The way we create dummy variables are:

For each predicting numeric variable,

  1. We create by default 10 equal-size bins, examine the weights of evidence (WOE) for each bin and merge adjacent bins if their WOEs are similar (meaning their risk signals are similar.

  2. Then we create dummy variables fomr this numeric variable according to the final bins. Say if we end up with 3 bins for this variable, then we create 3 dummy variables representing the bins. Then we pick one dummy variable as reference to avoid perfect multicollinearity.

  3. Then we run logistic regression.

The above is the highlevel description of how we do the modelling. My question is not specific to this procedure but more of a general dummy variable regression question: In the case when we have many variables (~20) for regression, the number of dummy variables will be even more and the regression result often says some dummy variables are insignificant. How do you treat insignificant coefficient estimates? What if a variable is significant in variable level, having some significant dummy variables but not all?

Thank you.

Best Answer

I am not a big fan of converting a continuous variable to multiple dummy variables. I guess the binning procedure is considered standard practice in score card development.

Regarding dummy variable insignificance: When you add a dummy variable in regression, the omitted group act as reference group. The reference group is compared to other groups corresponding to the dummy variables. When variables have a nonlinear relationship (e.g. quadratic) with log odds, you may get some dummy variables that are insignificant (the group whose effect is near to the reference group). My suggestion to see the pattern of log-odds in each bin before merging. Either you can make fewer final bins depending one the pattern or change the reference group. I know it is bit abstract. But, I will not be able to go to specific without knowing the case.

You could also drop the insignificant variable. Doing it this way, you are merging the group associated with dropping dummy. It may not be appropriate if the merging of reference group and the dummy group (insignificant) doesn't make business sense.