A note on terminology:
As far as I am aware (unfortunately, there are a lot of blogs written by people who overlook the subtle differences and thus mis-information spreads):
One hot encoding is exactly what you described, generating a map from each unique value in a string column to an integer
Dummying is making K new columns (in which K is the number of unique values), of which exactly one column per row must be one.
In the "dog, cat, horse" example, when using a decision tree, consider the following example. Perhaps your target variable is "has it ever meowed?". Clearly what you want your decision tree to do is be able to ask the question "is it a cat? (yes/no)".
If you one-hot encode, such that dog -> 0, cat-> 1, horse->2, the tree can't isolate all of the cats using one question, because decision trees always split using "is feature x greater than or less than X?"
If you're using logistic regression, it also can't assign higher probabilities of meowing to cats.
If you dummy, the tree can explicitly ask the question "the column which signifies cat greater than 0.5?", thus splitting your data into cats and not cats.
If you use logistic regression, your optimiser can learn that the coefficient related to this column should be positive.
Thus in my opinion, whenever you have categorical data which has no implicit ordinality, always dummy, never one-hot encode.
In the case where your data has high cardinality, this could cause problems, especially if the number of examples of each type is tiny, but this is a problem you can't really solve, you simply have too detailed information for the size of your training data and using it would lead to over-fitting.
Nonetheless, one way to mitigate this, is to do some manual clustering (or actual clustering), in which you make a synthetic column, which can take fewer values, and many of the unique values of the original column map to the same value in the new column (e.g. dog, cat, horse-> mammal, pigeon, parrot , chicken -> bird). This makes it easier for the algorithm to learn, and if there's enough data, it can split further within each cluster.
No, it does not make sense. If you have a categorical variable Cat
with 10 levels A, B, C,..., J
that you one-hot encode, then the variable is Cat
, and if you want feature selection, you should choose Cat
or omit Cat
, with all or none of its one-hot-encoded columns. Omitting just some of the columns will change the meaning of the model/variable.
More concretely, if you as usual drop one of the columns as a reference level, say A
, and then later your feature extraction is dropping C
, that makes the model assuming that levels A, C
acts identically, and that might be wrong. Also, if you at the outset choose some other reference level, that might lead to very different results.
This is already discussed in here, see especially Can I ignore coefficients for non-significant levels of factors in a linear model?, Is it advisable to drop certain levels of a categorical variable?, and Frank Harrell's answer here: Can a factor be changed to binomial levels to achieve model validation and extract insignificant variables?
If the problem is that there is very many levels, and you want some data-driven way of collapsing them, then see Principled way of collapsing categorical variables with many levels?
Best Answer
Some of the answers here seem more complicated or hi-falutin' than may be needed or indeed justified. For example, in many projects short of say Ph.D. level, getting into imputation may be beyond the time available or the skill level expected.
Also, the title says "N/A" which often means "not applicable". In the question itself the OP treats N/A as the researcher's coding for blank answers, in effect "no answer". Some of my answer covers N/A as a deliberate and allowed reply to a question. There is a lot of turbulent water between these interpretations.
It seems to me that the simplest possibilities have not yet been mentioned at all.
N/A is just another category. A side-effect of that: the scale is no longer ordinal.
Leave out the N/As from the data analysis. That is ethically sensitive as well as statistically sensible. Seriously, if I fill in a questionnaire and am given N/A as an option and then use it, that's my right within the survey I undertake. It should be taken that I meant what I said. Suppose that I really don't play golf or use Twitter or whatever it is, and that is why a question on preference for golf clubs or my Twitter attitudes really doesn't apply. I don't want some fool of a researcher analysing the data to presume or assume that I "really" meant something else and that the rest of my answers somehow are informative on what that might be. This also applies if I leave a question unanswered, given scope to do that.
There is one alternative that occasionally may make sense, which is to guess that N/A is in some sense equivalent to the neutral category. This has to be argued carefully, but the spirit is, perhaps, that for some questions they may both be flavours of "don't care".
Whatever the research strategy, if N/As are at all common, then a really careful study will include some kind of sensitivity analysis and explore different ways of analysing them, and explain how much difference it makes to the results of the study. Researchers often prefer not to do this, partly because it tends to underline how lousy the data are.