Solved – Removing skew from ordinal variables

categorical datacategorical-encodingdata transformationordinal-dataskewness

I'm working on the ames housing data set and wondering how to deal with some string-valued variables.

The variable LandSlope can take the values Sev for "severe", Mod for "moderate" and Gtl for "gentle. This indicates that it is actually an ordinal value and we should use label encoding to retain this ordinality, rather than using One-Hot-Encoding.

I apply the label encoding

data['LandSlope']=data['LandSlope'].replace(['Sev', 'Mod', 'Gtl'], [2, 1, 0])

Now I have a new numerical variable that takes on values from the set {2, 1, 0}. Analysing the distribution of this variable I find there is significant right-skewness:

enter image description here

By transforming the variable with the Box-Cox transformation I could bring reduce the skewness from from 4.9733 to 4.2117. ( As depicted above).

However looking at the quantile plot, I'm wondering if this is actually advised. Is it recommended to reduce the skewness of variables after applying LabelEncoding?

Best Answer

Partially answered in comments, a short summary:

You have an ordinal predictor variable, and how to represent it in part depends on how you will use it. If you just use it as an numerical variable in a linear regression, using values like $1,2,3$ (or $-1,0,1$), you are assuming that the difference (in effect on the target variable) that Mod is halfway between Gtl and Sev. But if you model with a monotone spline (or even with a quadratic term) such an assumption is avoided.

But all of this has little to do with skewness, and the use of a Box-Cox transform is difficult to understand. For more detail and opinion see all the excellent comments.

Related Question