Solved – Does it make sense to apply recursive feature elimination on one-hot encoded features

categorical-encodingfeature selectionpandaspythonscikit learn

Does it make sense to apply recursive feature elimination on a feature set pre-processed with One-Hot Encoding?

This is my code for feature selection:

xgb = XGBClassifier(n_estimators=100, 
                    objective='multi:softprob', 
                    num_class=4, 
                    random_state=42)
rfecv = feature_selection.RFECV(estimator=xgb, 
                                step=10, 
                                cv=model_selection.StratifiedKFold(2), 
                                scoring='f1_weighted', 
                                n_jobs = -1,
                                verbose = 2)
rfecv.fit(X_train, y_train)

DataFrame X_train contains both continuous and categorical features. Categorical features are one-hot encoded, while continuous features are passed through MinMaxScaler.

I am not sure if it makes sense to eliminate one-hot encoded columns using RFECV. Maybe I should run RFECV on continuos features only? Or I should apply one-hot encoding somehow at each iteration of RFECV?

Best Answer

No, it does not make sense. If you have a categorical variable Cat with 10 levels A, B, C,..., J that you one-hot encode, then the variable is Cat, and if you want feature selection, you should choose Cat or omit Cat, with all or none of its one-hot-encoded columns. Omitting just some of the columns will change the meaning of the model/variable.

More concretely, if you as usual drop one of the columns as a reference level, say A, and then later your feature extraction is dropping C, that makes the model assuming that levels A, C acts identically, and that might be wrong. Also, if you at the outset choose some other reference level, that might lead to very different results.

This is already discussed in here, see especially Can I ignore coefficients for non-significant levels of factors in a linear model?, Is it advisable to drop certain levels of a categorical variable?, and Frank Harrell's answer here: Can a factor be changed to binomial levels to achieve model validation and extract insignificant variables?

If the problem is that there is very many levels, and you want some data-driven way of collapsing them, then see Principled way of collapsing categorical variables with many levels?

Related Question