I created a logistic regression model in R and fit the model using the MumIn package. I have several categorical variables that were coded as factors. For example, season (summer, fall, winter, spring), and color (brown, tan, white). The regression seemed to work fine – I didn't have any warnings or errors, but I recently stumbled across one-hot encoding, and I am wondering if I need to re-code the factors. Is one-hot encoding necessary for all non-ordinal categorical variables? How would one-hot encoding change how the variables are analyzed in the model?
Logistic Regression in R – How to Handle Categorical Variables with One-Hot Encoding
categorical datacategorical-encodinglogisticrregression
Related Solutions
ttnphns is correct.
However, given your additional comments I would suggest that the reviewer wanted the change merely for interpretation. If you want to stick with ANOVA style results just call it ANOVA. ANCOVA and ANOVA are the same, as ttnphns pointed out. The difference is that with ANCOVA you don't treat the covariates as predictors and you definitely appear to want to do just that.
What the reviewer was getting at was that, while you can perform an ANOVA on continuous predictors, it's typical that one perform a regression. One feature of this is that you get estimates of the effects of the continuous variable and you can even look at interactions between it and the categorical (which aren't included in an ANCOVA but could be in an ANOVA).
You may need some help with interpretation of regression results because funny things happen on the way to interactions if you're going to use the beta values to determine the significance of your effects.
Scikit-learn's linear regression model allows users to disable intercept. So for one-hot encoding, should I always set fit_intercept=False? For dummy encoding, fit_intercept should always be set to True? I do not see any "warning" on the website.
For an unregularized linear model with one-hot encoding, yes, you need to set the intercept to be false or else incur perfect collinearity. sklearn
also allows for a ridge shrinkage penalty, and in that case it is not necessary, and in fact you should include both the intercept and all the levels. For dummy encoding you should include an intercept, unless you have standardized all your variables, in which case the intercept is zero.
Since one-hot encoding generates more variables, does it have more degree of freedom than dummy encoding?
The intercept is an additional degree of freedom, so in a well specified model it all equals out.
For the second one, what if there are k categorical variables? k variables are removed in dummy encoding. Is the degree of freedom still the same?
You could not fit a model in which you used all the levels of both categorical variables, intercept or not. For, as soon as you have one-hot-encoded all the levels in one variable in the model, say with binary variables $x_1, x_2, \ldots, x_n$, then you have a linear combination of predictors equal to the constant vector
$$ x_1 + x_2 + \cdots + x_n = 1 $$
If you then try to enter all the levels of another categorical $x'$ into the model, you end up with a distinct linear combination equal to a constant vector
$$ x_1' + x_2' + \cdots + x_k' = 1 $$
and so you have created a linear dependency
$$ x_1 + x_2 + \cdots x_n - x_1' - x_2' - \cdots - x_k' = 0$$
So you must leave out a level in the second variable, and everything lines up properly.
Say, I have 3 categorical variables, each of which has 4 levels. In dummy encoding, 3*4-3=9 variables are built with one intercept. In one-hot encoding, 3*4=12 variables are built without an intercept. Am I correct?
The second thing does not actually work. The $3 \times 4 = 12$ column design matrix you create will be singular. You need to remove three columns, one from each of three distinct categorical encodings, to recover non-singularity of your design.
Best Answer
Presumably the package you use builds a design matrix using the built-in functions of R. These do dummy coding for factors, which is almost one-hot encoding, but one class is used as a reference class. This means that for $n$ classes there will be $n-1$ binary indicator variables. For the reference class all these are 0. For any other class a single indicator will be 1 and the rest 0.
It is not advisable to use, say, the integer values assigned to a factor coding directly in a model. Imagine we have a factor variable for color: yellow is 1, green is 2, red is 3. These numbers imply that red is somehow "2 more than" yellow, which is nonsense. In this sense you need one-hot encoding or something like it to deal with an unordered classification.