Solved – Handling categorical variables in various ML algorithms

machine learningpythonrrandom forestscikit learn

I have read at many places that Decision trees and Random forests, if deep enough, can handle categorical variables without one-hot encoding.

1) What is special about these algorithms that they can handle categorical variables without one-hot encoding?

2) Are there any specific algorithms in which we do Dummy encoding (n-1 columns created for a categorical variable) vs One Hot encoding (n columns created)? I fail to understand that when one column in One hot encoding is having information that can be gathered from other columns, why do we ever prefer One hot encoding and why does the concept even exist?

3) Why does it say "if deep enough"?

Any helpful resources, links, videos or your own explanations are welcome. I want to clear this doubt, once and for all.

Found a similar question but doesnt answer everything I want to know.
https://datascience.stackexchange.com/questions/5226/strings-as-features-in-decision-tree-random-forest/19829#19829

from AN6U5's answer – says Random Forest does not require One-hot Encoding and I have read many more similar answers saying this.

Also, What algorithms require one-hot encoding?

Best Answer

  1. If the categorical variable is non-numeric, most software (like R) will convert the one column into n columns. It could be that they are not referring to anything special about the mathematical algorithm, but simply that the function or program you are using handles the one-hot encoding automatically.

  2. A reason for converting a categorical variable to n-1 dummy variables is because of an issue in OLS regression known as perfect multi-collinearity. In OLS regression, the excluded n-th dummy variable is now used as a reference for the n-1 other dummy variables. Perfect multicollinearity is NOT an issue in decision tree algorithms, and so you can use one-hot encoding. In fact, it'd be better to use one-hot encoding, as excluding that n-th column actually excludes information in these algorithms.

  3. You'll have trouble answering this question, as "depth" has an arbitrary meaning. The only time it makes sense to leave a categorical variable as one column in your data is if the categorical variable is numeric and there is an order present. For instance, 2 is more than 1, 3 is more than 2, so on... At least this has been my experience.