Scikit-learn's linear regression model allows users to disable intercept. So for one-hot encoding, should I always set fit_intercept=False? For dummy encoding, fit_intercept should always be set to True? I do not see any "warning" on the website.
For an unregularized linear model with one-hot encoding, yes, you need to set the intercept to be false or else incur perfect collinearity. sklearn
also allows for a ridge shrinkage penalty, and in that case it is not necessary, and in fact you should include both the intercept and all the levels. For dummy encoding you should include an intercept, unless you have standardized all your variables, in which case the intercept is zero.
Since one-hot encoding generates more variables, does it have more degree of freedom than dummy encoding?
The intercept is an additional degree of freedom, so in a well specified model it all equals out.
For the second one, what if there are k categorical variables? k variables are removed in dummy encoding. Is the degree of freedom still the same?
You could not fit a model in which you used all the levels of both categorical variables, intercept or not. For, as soon as you have one-hot-encoded all the levels in one variable in the model, say with binary variables $x_1, x_2, \ldots, x_n$, then you have a linear combination of predictors equal to the constant vector
$$ x_1 + x_2 + \cdots + x_n = 1 $$
If you then try to enter all the levels of another categorical $x'$ into the model, you end up with a distinct linear combination equal to a constant vector
$$ x_1' + x_2' + \cdots + x_k' = 1 $$
and so you have created a linear dependency
$$ x_1 + x_2 + \cdots x_n - x_1' - x_2' - \cdots - x_k' = 0$$
So you must leave out a level in the second variable, and everything lines up properly.
Say, I have 3 categorical variables, each of which has 4 levels. In dummy encoding, 3*4-3=9 variables are built with one intercept. In one-hot encoding, 3*4=12 variables are built without an intercept. Am I correct?
The second thing does not actually work. The $3 \times 4 = 12$ column design matrix you create will be singular. You need to remove three columns, one from each of three distinct categorical encodings, to recover non-singularity of your design.
This depends on the models (and maybe even software) you want to use. With linear regression, or generalized linear models estimated by maximum likelihood (or least squares) (in R this means using functions lm
or glm
), you need to leave out one column. Otherwise you will get a message about some columns "left out because of singularities"$^\dagger$.
But if you estimate such models with regularization, for example ridge, lasso er the elastic net, then you should not leave out any columns. The regularization takes care of the singularities, and more important, the prediction obtained may depend on which columns you leave out. That will not happen when you do not use regularization$^\ddagger$. See the answer at How to interpret coefficients of a multinomial elastic net (glmnet) regression which supports this view (with a direct quote from one of the authors of glmnet
).
With other models, use the same principles. If the predictions obtained depends on which columns you leave out, then do not do it. Otherwise it is fine.
So far, this answer only mentions linear (and some mildly non-linear) models. But what about very non-linear models, like trees and randomforests? The ideas about categorical encoding, like one-hot, stems mainly from linear models and extensions. There is little reason to think that ideas derived from that context should apply without modification for trees and forests! for some ideas see Random Forest Regression with sparse data in Python.
$^\dagger$ But, using factor variables, R will take care of that for you.
$^\ddagger$ Trying to answer extra question in comment: When using regularization, most often iterative methods are used (as with lasso or elasticnet) which do not need matrix inversion, so that the design matrix do not have full rank is not a problem. With ridge regularization, matrix inversion may be used, but in that case the regularization term added to the matrix before inversion makes it invertible. That is a technical reason, a more profound reason is that removing one column changes the optimization problem, it changes the meaning of the parameters, and it will actually lead to different optimal solutions. As a concrete example, say you have a categorical variable with three levels, 1,2 and 3. The corresponding parameters is $\beta_, \beta_2, \beta_3$. Leaving out column 1 leads to $\beta_1=0$, while the other two parameters change meaning to $\beta_2-\beta_1, \beta_3-\beta_1$. So those two differences will be shrinked. If you leave out another column, other contrasts in the original parameters will be shrinked. So this changes the criterion function being optimized, and there is no reason to expect equivalent solutions! If this is not clear enough, I can add a simulated example (but not today).
Best Answer
The issue with representing a categorical variable that has $k$ levels with $k$ variables in regression is that, if the model also has a constant term, then the terms will be linearly dependent and hence the model will be unidentifiable. For example, if the model is $μ = a_0 + a_1X_1 + a_2X_2$ and $X_2 = 1 - X_1$, then any choice $(β_0, β_1, β_2)$ of the parameter vector is indistinguishable from $(β_0 + β_2,\; β_1 - β_2,\; 0)$. So although software may be willing to give you estimates for these parameters, they aren't uniquely determined and hence probably won't be very useful.
Penalization will make the model identifiable, but redundant coding will still affect the parameter values in weird ways, given the above.
The effect of a redundant coding on a decision tree (or ensemble of trees) will likely be to overweight the feature in question relative to others, since it's represented with an extra redundant variable and therefore will be chosen more often than it otherwise would be for splits.