Solved – (Automated) feature selection in multiple regression with categorical variables

categorical datafeature selectionmultiple regressionregressionstepwise regression

I need a general guide on what are the appropriate approaches to automated feature selection in multiple regression with categorical variables.

In my case, I have several numeric and categorical independent variables. I want to predict a numeric value and I am going to make use of multiple regression, including these categorical variables according to the effect coding strategy (find effect coding ref. here).

My questions are:

  • I am familiar with stepwise feature selection methods that I used in logistic regression models. Are they likely to be successful in this case, too?

  • When is there a moment to apply such automated feature selection methods? I mean: if I run them after introducing effect-transformed variables, there is a possibility the method reject e.g. a part of effect-transformed variables, drawn from one categorical variable (this categorical variable is not fully represented then), isn't it? Is this a problem?

  • What are the most popular automated feature selection methods when dealing with categorical variables?

Best Answer

Stepwise regression does not work well with logistic regression and I expect it to be equally unsuccessful here. What made you think you need feature selection as part of the modeling process?

If you absolutely do need to incorporate feature selection, choose a method that keeps together multiple parameters describing one predictor, such as $F$ tests with multiple numerator degrees of freedom or other simple translations of partial sums of squares.