Solved – Regression using dumthe variables

categorical dataregression

I am working on a credit scoring modelling project and we decided to use dummy variables for regression. The way we create dummy variables are:

For each predicting numeric variable,

We create by default 10 equal-size bins, examine the weights of evidence (WOE) for each bin and merge adjacent bins if their WOEs are similar (meaning their risk signals are similar.
Then we create dummy variables fomr this numeric variable according to the final bins. Say if we end up with 3 bins for this variable, then we create 3 dummy variables representing the bins. Then we pick one dummy variable as reference to avoid perfect multicollinearity.
Then we run logistic regression.

The above is the highlevel description of how we do the modelling. My question is not specific to this procedure but more of a general dummy variable regression question: In the case when we have many variables (~20) for regression, the number of dummy variables will be even more and the regression result often says some dummy variables are insignificant. How do you treat insignificant coefficient estimates? What if a variable is significant in variable level, having some significant dummy variables but not all?

Thank you.

Best Answer

I am not a big fan of converting a continuous variable to multiple dummy variables. I guess the binning procedure is considered standard practice in score card development.

Regarding dummy variable insignificance: When you add a dummy variable in regression, the omitted group act as reference group. The reference group is compared to other groups corresponding to the dummy variables. When variables have a nonlinear relationship (e.g. quadratic) with log odds, you may get some dummy variables that are insignificant (the group whose effect is near to the reference group). My suggestion to see the pattern of log-odds in each bin before merging. Either you can make fewer final bins depending one the pattern or change the reference group. I know it is bit abstract. But, I will not be able to go to specific without knowing the case.

You could also drop the insignificant variable. Doing it this way, you are merging the group associated with dropping dummy. It may not be appropriate if the merging of reference group and the dummy group (insignificant) doesn't make business sense.

Related Solutions

Solved – Binning raw data prior to building a logistic regression model

Binning will result in a more complex model, i.e., you will need more terms in the model to predict the outcome as well as a model that treats the predictors as continuous. Bins also bring a degree of arbitrariness into the model. Take a look at regression splines as an alternative. Notes about this may be found at http://biostat.mc.vanderbilt.edu/rms. Also make sure that your outcome is truly dichotomous, i.e., that the time until the event is irrelevant and you have no censoring.

Solved – Dumthe variables in a multiple regression

To get started, let's look at an example of what your regression output might look like.

Pred  Estimate  StdErr  t      p        sig
A1     1.0      0.2      5.00   0.0005  *
A2    -1.9      2.0     -0.95   0.1850
A3     4        0.1      40.0  <0.0001  *
d1    -2        1.1     -1.81   0.0539
d2     0.5      0.1      5.00   0.0005  *

Of special interest to you is the sig column, which as an * if and only if the p-value for the corresponding variable is statistically significant given all the other variables in the model.

When I estimate the model with all the variables included some of independent variables are not significant but when I add just one of the dummy variables all of the independent variables are significant.

Think of each variable as carrying some information about the response Y, and a variable being significant if it carries "enough" of that information, in some sense. You can think of the * in the table to mean that if we dropped that one variable and left the others, we would lose a significant amount of information. A lack of a * would then mean that we could drop that variable and as long as we kept the rest we wouldn't lose too much information.

Now let's say you dropped d1 because it wasn't significant, and your table now looks like this:

Pred  Estimate  StdErr  t      p        sig
A1     1.1      0.2      5.50   0.0003  *
A2    -4.1      1.2     -3.42   0.0045  *
A3     4.2      0.1      42.0  <0.0001  *
d2     0.4      0.1      4.00   0.0020  *

Let's pretend A2 is weight and d1 is sex. It might be that weight and sex carry much of the same information about Y, especially since they are correlated. Therefore when we had weight (A1) and sex (d1) in the model, each was a bit redundant when the other was present, and we could drop one as long as we kept the rest. Now once we've dropped sex, all the information that was present in both weight and sex is now present only in weight, and if we now drop weight, we will lose that information. Thus weight (A2) has become significant.

And, when I estimate the model in the form of f(A1,A2,A3,d1) I get different coefficients for the independent variables in comparison with the ones for f(A1, A2, A3, d2).

Recall that the regression model looked like this: $$ \hat{\bar{Y}}_i = \hat{\beta}_0 + \hat{\beta}_1 A_{1,i} + \hat{\beta}_2 A_{2,i} + \hat{\beta}_3 A_{3,i} + \hat{\beta}_4 d_{1,i} + \hat{\beta}_5 d_{2,i}. $$

Now once we've dropped d1, it looks like this: $$ \hat{\bar{Y}}_i = \hat{\beta}_0 + \hat{\beta}_1 A_{1,i} + \hat{\beta}_2 A_{2,i} + \hat{\beta}_3 A_{3,i} + \hat{\beta}_5 d_{2,i}. $$ If we kept the $\hat{\beta}_{p,i}$ the same, each $\hat{\bar{Y}}_i$ would now be decreased by $\hat{\beta}_4 d_{1,i}$, which would be the difference between the right-hand sides of the two equations. That doesn't really make sense, though, so the estimates of the coefficients have to be changed. The rest of the changes in the table will follow from those new estimates.

Best Answer

Related Solutions

Solved – Binning raw data prior to building a logistic regression model

Solved – Dumthe variables in a multiple regression

Related Question