Solved – Recursive feature elimination and one-hot & dumthe encoding

categorical-encodingfeature selectionlogisticregressionstepwise regression

When using RFE in linear regression and logistic regression, do we one-hot encode the features (K levels and K dummy features) or dummy-encode the features (K levels and K-1 dummy features leaving one out).

As per a comment by @Matthew Drury in an answer (URL below), one hot encoding is applied for a regularized linear model and for unregularized linear model dummy encoding. My doubt is what type of encoding when using RFE without any L1/L2 penalties.

Problems with one-hot encoding vs. dummy encoding

My understanding is since in RFE some features gets eliminated so if for a categorical variable with say 4 levels we do dummy encoding and have 3 features/levels in model & RFE eliminated 1, we will only have 2 features/levels left and the interpretation of its coefficient would not make sense in absence of the one level which was left out as reference.

Whereas if we have done one-hot encoding and RFE considers 2 features as important and eliminates other 2 then we can very well judge/interpret the coefficients or importance of 2 features RFE keeps.

So question which type of encoding is needed to be done when using RFE with linear and logistic regression?

Best Answer

This is not really about the kind of encoding, as already explained in comment by @ttnphns. The kind (dialect) of encoding is more of an algorithm/implementation detail. For variable reduction (feature elimination) in logistic regression see How to reduce predictors the right way for a logistic regression model and this list of answered questions about rfe.

If the question is about dropping/joining some of the levels of the categorical variable, that is a very different question. Mostly the answer is: don't do it, it changes the definition of the variable. The categorical variable with all of its levels is the variable, and should be dropt or kept as is. See Does it make sense to apply recursive feature elimination on one-hot encoded features?.

If the problem is very many levels, see Principled way of collapsing categorical variables with many levels?.

Part of the problem comes from a lack of abstraction. When the linear model is presented, with matrix language, as $$ Y=X\beta +\epsilon $$ this is not really where modeling starts. Some of the columns of $X$ really represent one 1D-variable, maybe a continuous variable like age, but others come from multi-df variables, maybe a spline or polynomial in age, maybe a factor, ... The matrix language used above forgets about this relations, so the multiple columns representing some logical variable are "forgotten", this relationship is unrepresented in the matrix formulation, which is a loss. Some modeling languages, like R, preserve this relationship, in R with terms objects. So, if RFE is used with one-hot coded columns, it should be done not at the column level, but at the terms level. With R, if you do the one-hot encoding not "by hand", yourself, but by declaring a factor variable and leaving the actual coding to R, the R in-built functions for stepwise modeling will use the terms structure and so do the right thing. If RFE is a good idea at all, is another question, see Are there any circumstances where stepwise regression should be used?