This depends on the models (and maybe even software) you want to use. With linear regression, or generalized linear models estimated by maximum likelihood (or least squares) (in R this means using functions lm
or glm
), you need to leave out one column. Otherwise you will get a message about some columns "left out because of singularities"$^\dagger$.
But if you estimate such models with regularization, for example ridge, lasso er the elastic net, then you should not leave out any columns. The regularization takes care of the singularities, and more important, the prediction obtained may depend on which columns you leave out. That will not happen when you do not use regularization$^\ddagger$. See the answer at How to interpret coefficients of a multinomial elastic net (glmnet) regression which supports this view (with a direct quote from one of the authors of glmnet
).
With other models, use the same principles. If the predictions obtained depends on which columns you leave out, then do not do it. Otherwise it is fine.
So far, this answer only mentions linear (and some mildly non-linear) models. But what about very non-linear models, like trees and randomforests? The ideas about categorical encoding, like one-hot, stems mainly from linear models and extensions. There is little reason to think that ideas derived from that context should apply without modification for trees and forests! for some ideas see Random Forest Regression with sparse data in Python.
$^\dagger$ But, using factor variables, R will take care of that for you.
$^\ddagger$ Trying to answer extra question in comment: When using regularization, most often iterative methods are used (as with lasso or elasticnet) which do not need matrix inversion, so that the design matrix do not have full rank is not a problem. With ridge regularization, matrix inversion may be used, but in that case the regularization term added to the matrix before inversion makes it invertible. That is a technical reason, a more profound reason is that removing one column changes the optimization problem, it changes the meaning of the parameters, and it will actually lead to different optimal solutions. As a concrete example, say you have a categorical variable with three levels, 1,2 and 3. The corresponding parameters is $\beta_, \beta_2, \beta_3$. Leaving out column 1 leads to $\beta_1=0$, while the other two parameters change meaning to $\beta_2-\beta_1, \beta_3-\beta_1$. So those two differences will be shrinked. If you leave out another column, other contrasts in the original parameters will be shrinked. So this changes the criterion function being optimized, and there is no reason to expect equivalent solutions! If this is not clear enough, I can add a simulated example (but not today).
One hot encoding would be a preliminary step toward dummy coding or effect coding or any other parameterization of a categorical variable. I don't know anything about scikit-learn (and questions about code are off topic here) but statistical programs such as SAS, R, SPSS, etc. do this encoding for you. It simply takes a single column of labels and turns it into k columns of 0's and 1's where there are k different labels.
You then have to choose what parameterization you want and which label you would like to use as your reference category. This has been discussed here before and will also be covered in any basic regression book.
Best Answer
This is not really about the kind of encoding, as already explained in comment by @ttnphns. The kind (dialect) of encoding is more of an algorithm/implementation detail. For variable reduction (feature elimination) in logistic regression see How to reduce predictors the right way for a logistic regression model and this list of answered questions about rfe.
If the question is about dropping/joining some of the levels of the categorical variable, that is a very different question. Mostly the answer is: don't do it, it changes the definition of the variable. The categorical variable with all of its levels is the variable, and should be dropt or kept as is. See Does it make sense to apply recursive feature elimination on one-hot encoded features?.
If the problem is very many levels, see Principled way of collapsing categorical variables with many levels?.
Part of the problem comes from a lack of abstraction. When the linear model is presented, with matrix language, as $$ Y=X\beta +\epsilon $$ this is not really where modeling starts. Some of the columns of $X$ really represent one 1D-variable, maybe a continuous variable like
age
, but others come from multi-df variables, maybe a spline or polynomial inage
, maybe a factor, ... The matrix language used above forgets about this relations, so the multiple columns representing some logical variable are "forgotten", this relationship is unrepresented in the matrix formulation, which is a loss. Some modeling languages, like R, preserve this relationship, in R withterms
objects. So, if RFE is used with one-hot coded columns, it should be done not at the column level, but at theterms
level. With R, if you do the one-hot encoding not "by hand", yourself, but by declaring a factor variable and leaving the actual coding to R, the R in-built functions for stepwise modeling will use theterms
structure and so do the right thing. If RFE is a good idea at all, is another question, see Are there any circumstances where stepwise regression should be used?