Solved – Combining ordinal and categorical (one-hot encoded) variables in one model

categorical-encodingfeature-engineering

I have a dataset with combination of ordinal and categorical (strictly discrete) variables. I want to predict another discrete variable (can be probably restricted to binary, but better, ordinal in general). Lets not be bounded to some specific model, lets say some canonical from scikit-learn, like tree-based methods or logistic regression.

I encoded ordinal variables to a natural scale, so they are just integers (with relatively low max size 50, but usually much less, like 0-9). Categorical variables are one-hot encoded (in pandas something like pd.get_dummies).

I am wondering if I can use this mixed-type dataset as an input in a model. I would say that it matters on algorithm used. E.g. logistic regression could be able to handle this mixed dataset well – there will be either 0 or 1 before $\alpha_i$-ith coefficient in case of categorical, and natural number in front of $\alpha_j$-jth in case of ordinal feature. In case of decision trees it basically doesn't matter (they just could have less splits on ordinals).

The reason why I don't want to encode ordinal variables is losing some information (and increasing dimension). I am not asking if the model performs better – that is of course better to try and evaluate. I am just asking if there is some standard approach to this and if my reasoning is correct or somehow fundamentally incorrect.

Best Answer

Yes, your reasoning is correct, and you could make both mentioned variants work. Some things I'd like to point out in short:

  • Think about normalizing the data you obtain from whichever encoding you use. This will have different impact on different models (e.g. more impact on logistic regression than tree based models).

  • For the standard approach: there are multiple possibilities used out there. As you pointed out, e.g. tree based models can naturally deal with such data, and both categorial and one-hot encoded variables are actually used with it (see e.g. Applied Predictive Modeling by Max Kuhn and Kjell Johnson). But if you are dealing with other models, one-hot encoding ore similar might become necessary, as the model might infer different/wrong information from such features otherwise.

  • If you happen to have many features, think about employing features selection first (e.g. feature filters, feature wrappers). You could also use such after one-hot encoding variables.

Related Question