Solved – the best way to handle ordinal features having numeric values in python?

categorical-encodingmachine learningordinal-datapython

What is the best way to encode ordinal feature?

  1. Is it by transforming it using OneHotEncoder so values going from 1 to 7 lets say would become head of new field feature.

  2. Or by using StandardScaler() to scale the values between -1 and 1 ?

Best Answer

The problem isn't really with how you scale it but with what you do with it once it's scaled (if you scale it at all).

There are two usual methods of dealing with ordinal independent variables: 1) Treat them as continuous (both methods of scaling that you propose seem to do this) or 2) Treat them as categorical and ignore the ordering.

The problem with 1) is that it may not be reasonable. It assumes that you can impose some sort of intervals on the ordinal data. The two scalings impose different intervals, but both impose an interval. That might be reasonable (e.g. it's usually reasonable with Likert scale data) but it might not (e.g. if 1 is 10 times a day, 2 is "a few times a day" .... and 7 is "never").

The problem with 2) is that it ignores the ordering altogether.

If you are doing regression and this is an independent variable, you could look into optimal scaling. I don't know if it's available in Python, but it's in R (optiscale package) and SAS (PROC TRANSREG).

If you are doing something like chi-square, then you could look into the Jonckheere Terpstra test, which I think is quite useful for this sort of thing.

And, for any particular ordinal scale, there may be some particular sensible scaling.

Related Question