Solved – Label encoding vs Dumthe variable/one hot encoding – correctness

categorical-encodingmodelingregression

I understand that when label encoding is used ,the numeric number can be interpreted to have an order and a model could assume a linear relationship. However shouldn't this be a problem when there are in-fact many levels in a categorical variable e.g. country. How about the case of binary variables which for e.g. instead of gender.male ( 1,0) what if I just used Gender ( 1,0) where gender = 0 is female and 1 is male for instance. This shouldn't impact the model as much as a label encoding to a feature with multiple levels? How would this be for a case where I have a feature three levels ( -1,0,1) where -1 means not applicable, 0 means 'No' and 1 means 'Yes', so instead of having 2 columns feature.not_applicable(1,0) and feature.No(1,0) – mathematically how would models be impacted – models here would be GLMS, boosting models, Random forests etc. Is label encoding recommended when feature has say <=3 to 4 levels and above that is recommended to do one hot encoding or $n-1$ dummy variables?

Best Answer

It seems that "label encoding" just means using numbers for labels in a numerical vector. This is close to what is called a factor in R. If you should use such label encoding do not depend on the number of unique levels, it depends on the nature of the variable (and to some extent on software and model/method to be used.) Coding should be seen as a part of the modeling process, and not only as some preprocessing!

Similar questions have been asked before, and you can find some good questions&answers here. But in short:

  1. If the levels are ordered, you could use numerical encoding ("label encoding", but assuring that the numbers are assigned in correct order.)

  2. If not ordered, you need dummy variables.

  3. For binary variables, like Sex, it does not matter if you code as numerical 0/1 or as a factor, in both cases it will be treated the same way in a model.

  4. If one variable has a value "not applicable" (like being pregnant for men), then see How do you deal with "nested" variables in a regression model?

  5. If you have categories with very many levels see Principled way of collapsing categorical variables with many levels?

  6. Most of theory and practice about categorical variables is developed in the context of linear models, glm's or at least models with some linear elements. Trees and forests are not in this class, so might require new/different thinking, and maybe depend much more on software. See for instance Dropping one of the columns when using one-hot encoding and Random Forest Regression with sparse data in Python.