Most algorithms (linear regression, logistic regression, neural network, support vector machine, etc.) require some sort of the encoding on categorical variables. This is because most algorithms only take numerical values as inputs.
Algorithms that do not require an encoding are algorithms that can directly deal with joint discrete distributions such as Markov chain / Naive Bayes / Bayesian network, tree based, etc.
Additional comments:
Firstly, you certainly do not need to add explanatory vectors that are linear combinations of existing explanatory vectors in the model. This leads to identifiability problems, and ---at best--- these will be handled by the algorithm ignoring one of your inputs. Thus, when working with a factor variable with $k$ categories, you would either use $k$ indicators and no intercept term, or use $k-1$ indicators with an intercept term. There are some advantages and disadvantages to both methods, depending on what you want to do.
Using $k$ indicators and no intercept term: With this method the coefficients corresponding to each of the indicators in your model are interpreted as absolute effects, and are not relative to any base category. This can be useful if you want to plot all of the estimated coefficients for the categories, and you don't want any of the estimated effects to be forced to a baseline of zero. Sometimes you want to see the estimated "total effect" for a particular category with its associated confidence interval, and this is easiest to obtain if you fit the model with this method. (You can still get it from the other method, but it requires some mucking around.)
The down-side of this method is that you have to be very careful when looking at ANOVA outputs and other outputs that compare your model to a null model. Since your specified model has no intercept term, these outputs will generally compare your model to a null model with no intercept term, so that the null model is effectively just white noise. This means that your ANOVA outputs and other similar outputs give comparisons to a really shitty model and the apparent success of your model will appear overstated.
As an example, suppose you build a regression model of height using sex as the explanatory variable, and you code it so that there is no intercept, but there are two indicators, for males and females. In this case, plotting your estimated coefficients for the two categories has a simple and natural interpretation -- they each represent estimates of the mean height of that sex. This is a nice aspect of the method. However, if you look at the ANOVA output for the model, it will be made against a null model that postulates heights of all people as white noise (with zero mean) rather than as having a non-zero mean but with no sex difference. This means that your ANOVA outputs will be very misleading.
Using $k-1$ indicators plus an intercept term: This is the opposite case, so the advantages and disadvantages are reversed. Under this method, the coefficients for categories of your factor variable represent relative effects, which are differences in effect size between the present category and the baseline category. This can be useful if you want to look at relative effects, but it means that you do not have direct access to the absolute effects, and these take a bit of mucking around to obtain.
The main upside of this method is that ANOVA output and other model-comparisons will compare your model to a null model with an intercept term, which is usually the comparison you want to make in these cases. This means that the outputs of these model comparisons will show the success or failure of your model against a baseline model that does not assume a zero mean for the response variable.
Continuing the above example, suppose you now build a regression model of height using sex as the explanatory variable, and you code it so that there is an intercept, and then an indicator for females (with males as the baseline category). In this case, plotting your estimated coefficients has a less simple and natural interpretation -- you get an estimate of the average height of all people, and an estimate of the mean difference in the height of males and females. This is probably not the ideal presentation of that information, so this is a sub-optimal aspect of this method. On the other hand, if you look at the ANOVA output for the model, it will be made against a null model that allows a non-zero mean height but with no sex difference. This means that your ANOVA outputs will give a useful comparison to a simple base model.
Best Answer
Imagine your have five different classes e.g.
['cat', 'dog', 'fish', 'bird', 'ant']
. If you would use one-hot-encoding you would represent the presence of 'dog' in a five-dimensional binary vector like[0,1,0,0,0]
. If you would use multi-hot-encoding you would first label-encode your classes, thus having only a single number which represents the presence of a class (e.g. 1 for 'dog') and then convert the numerical labels to binary vectors of size $\lceil\text{log}_25\rceil = 3$.Examples:
This representation is basically the middle way between label-encoding, where you introduce false class relationships (
0 < 1 < 2 < ... < 4
, thus'cat' < 'dog' < ... < 'ant'
) but only need a single value to represent class presence and one-hot-encoding, where you need a vector of size $n$ (which can be huge!) to represent all classes but have no false relationships.Note: multi-hot-encoding introduces false additive relationships, e.g.
[0,0,1] + [0,1,0] = [0,1,1]
that is'dog' + 'fish' = 'bird'
. That is the price you pay for the reduced representation.