Solved – Why does SAS Enterprise Miner keep all dumthe variables for a coded categorical variable in stepwise logistic regression

categorical datafeature selectionregressionsas

SAS Enterprise Miner nicely creates coded dummy variables for any categorical variables used in a logistic regression model. When it performs a variable selection using stepwise sequential selection in the Regression node, however, if one of the dummy variables is included in the regression model, all of the other dummy variables are then also automatically included, even if they are not found to be predictive of the target.

Here's a snippet of the node Results output after the stepwise selection showing that the dummy variables for some of the levels of the Industry variable are significant in the model, but others are not.

Parameter      DF    Estimate       Error    Chi-Square    Pr > ChiSq        

Intercept       1     -9.2383      1.9222         23.10        <.0001
IMP_REP_Age     1      0.3594      0.0938         14.69        0.0001
IMP_UnionSubs 
     No         1      0.5114      0.1472         12.07        0.0005

Industry
  Agriculture   1      1.4439      0.1871         59.54        <.0001
  Construction  1      1.2982      0.2228         33.97        <.0001
  Finance       1     -0.3826      0.2536          2.28        0.1313
  IT            1     -0.1355      0.2641          0.26        0.6080
  Professional  1      0.3569      0.3469          1.06        0.3037                 
  Public Sector 1     -2.3698      0.3522         45.28        <.0001              
  Retail        1     -1.3766      0.5483          6.30        0.0120                       

Occupation Type
   Casual       1      0.0260      0.2499          0.01        0.9171  
   Employed     1     -0.9068      0.1828         24.61        <.0001  

So, for example, the Industry-Agriculture variable seems predictive of the target, but the Industry-IT variables does not. All seven dummy variables for the seven levels of the Industry variable are included in the final model, however.

It seems to me that in the stepwise selection the dummy variables should be treated as individual variables rather than as a group. Does anyone know why SAS Enterprise Miner does it differently?

Best Answer

Much as I dislike stepwise regression, if you are going to do it, I think EM's behavior is appropriate. 1) Because a set of dummy variables all go together. In your model, you are saying that industry is a predictor, not a particular industry

2) If you dropped the nonsignificant dummy variables, the others would change because either you are then controlling for different things or eliminating some subjects from the sample.