Solved – Removing skew from ordinal variables

categorical datacategorical-encodingdata transformationordinal-dataskewness

I'm working on the ames housing data set and wondering how to deal with some string-valued variables.

The variable LandSlope can take the values Sev for "severe", Mod for "moderate" and Gtl for "gentle. This indicates that it is actually an ordinal value and we should use label encoding to retain this ordinality, rather than using One-Hot-Encoding.

I apply the label encoding

data['LandSlope']=data['LandSlope'].replace(['Sev', 'Mod', 'Gtl'], [2, 1, 0])

Now I have a new numerical variable that takes on values from the set {2, 1, 0}. Analysing the distribution of this variable I find there is significant right-skewness:

By transforming the variable with the Box-Cox transformation I could bring reduce the skewness from from 4.9733 to 4.2117. ( As depicted above).

However looking at the quantile plot, I'm wondering if this is actually advised. Is it recommended to reduce the skewness of variables after applying LabelEncoding?

Best Answer

Partially answered in comments, a short summary:

You have an ordinal predictor variable, and how to represent it in part depends on how you will use it. If you just use it as an numerical variable in a linear regression, using values like $1,2,3$ (or $-1,0,1$), you are assuming that the difference (in effect on the target variable) that Mod is halfway between Gtl and Sev. But if you model with a monotone spline (or even with a quadratic term) such an assumption is avoided.

But all of this has little to do with skewness, and the use of a Box-Cox transform is difficult to understand. For more detail and opinion see all the excellent comments.

Related Solutions

Data Transformation – Choosing c to Remove Skew with log(x + c)

Because the intention is to do OLS, the choice of $c$ should be made in this context.

In general, we ought to fit $c$ simultaneously with the rest of the regression. A quick and dirty way to do this recognizes that the regression $R^2$ is proportional to the log likelihood, so we could seek a value of $c$ that maximizes $R^2$.

This is a special example of the problem of choosing among a parameterized family of transformations $y \to f(y; \theta)$ to achieve the best possible fit of $y$ to explanatory values $x$. This can be solved in R rather simply and directly:

xform <- function(f, theta, x, y, ...) {
  g <- function(theta) -summary(lm(f(y, theta) ~ x))$r.squared
  nlm(g, theta, ...)
}

(I am glossing over a somewhat delicate matter of choosing good starting values for the parameter: it is possible to obtain bad solutions with nlm otherwise. Standard methods of exploratory data analysis will produce decent starting values, but that's a subject for another day.)

As an example of the use of xform, let's generate some highly skewed data for $y$ for which the "started logarithm" $\log(y+c)$ will produce an unskewed distribution:

set.seed(17)
y <- sort(exp(rnorm(32, 4, 1))) + 100

Evidently $\log(y-100)$ is drawn iid from a Normal distribution.

I will apply xform to three choices of $x$:

Values from which $y$ differs by additive, homoscedastic residuals. In this case it would be a mistake to take the logarithm of $y$: it is a grossly incorrect model of the relationship between $x$ and $y$.
Values from which $y$ differs by multiplicative lognormal residuals (more or less). In this case, taking the logarithm of $y$ is a good idea because it leads to a model for which OLS regression is appropriate.
Constant values of $x$, so that in effect we are looking at $y$ outside of the regression context altogether.

In cases (1) and (2) I will plot the histogram of $y$ (to show it is highly skewed), the scatterplot of $y$ against $x$ (to exhibit the data), and the scatterplot of the transformed $y$ against $x$ with the OLS line superimposed, to see the result of the transformation. In the third case those scatterplots are meaningless, so I only report the value of $c$ found by xform.

set.seed(17)
par(mfrow=c(2,3))

y <- sort(exp(rnorm(32, 4, 1))) + 100
x <- y - rnorm(length(y), 0, 50)
hist(y)
plot(x,y)
const <- xform(function(y,c) log(y+c), 1-min(y), x, y)$estimate
plot(x, log(y + const), ylab=sprintf("log(y + %0.1f)", const))
abline(coef(fit1<-lm(log(y+const)~x)), col="Gray")

x <- log((1:length(y) + rnorm(length(y), 10, 3)))
hist(y)
plot(x,y)
const <- xform(function(y,c) log(y+c), 1-min(y), x, y)$estimate
plot(x, log(y + const), ylab=sprintf("log(y + %0.1f)", const))
abline(coef(fit2<-lm(log(y+const)~x)), col="Gray")

x <- rep(1, length(y))
const <- xform(function(y,c) log(y+c), 1-min(y), x, y)$estimate

Plots

The top row is the first case and the second row of plots are for the second case.

Please observe:

The $y$ values are identical in all three instances.
The $y$ values are constructed from a model in which $c=-100$.
The fitted value of $c$ in the first case, $1408392.5$, is essentially infinite. This indicates it's bad to be taking the logarithm at all for these $x$ values. (Adding this huge value of $c$ to $y$ before taking the logarithm basically does not change the shape of the data: that's why the two scatterplots in the top row look the same.)
The fitted value of $c$ in the second case, $-104.0$, is close to the value of $-100$ used to generate the data. (Repeated simulations indicate that the fitted value in the second case will be biased slightly low, averaging around $-105$.)
The fitted value of $c$ in the third case is $-115.8$, still close to the value used to generate the data.

If we were to use the "universal" value of $c$ found in the third case (essentially by ignoring the $x$ values), here is what the scatterplot would look like in conjunction with the $x$ values from case 1:

Scatterplot

For these particular $x$ values, the OLS line fit to the transformed $y$ values is a terrible description of the relationship between $y$ and $x$. Notice how it underestimates most values of $y$ but grossly overestimates a few of them for $x$ between $140$ and $200$.

In summary, if you want to transform the response variable for a regression (to achieve symmetry or linearity), you must account for the regression itself. This is because the regression only "cares" about the residuals, not the raw values of $y$. As the extraordinarily bad value of $c$ in the first case shows, ignoring this advice could produce awful results.

Solved – negative impact from imbalance/skew in predictor variables

Lack of balance is not bad for a saturated model. With two categorical variables this means having $ AB $ predictors ($ A $ is number of categories in the variable, $ B $ the second). If you include the interaction term $ x_1$ and $ x_2$ you should be fine. If this fit is too "noisy" then include it as a random effect in a mixed model - so you get the benefits of partial pooling.

The problem comes in when you are pooling relationships across categories (such as main effects only). If you are polling across two imbalanced groups, the larger group contributes more to the pooled estimate. This causes problems when the groups are actually different, and the imbalance is "unrepresentative" - that is, an artifact of the sampling mechanism that you obtained your data from (eg quota sampling, or "balancing" by subsampling your data).

Best Answer

Related Solutions

Data Transformation – Choosing c to Remove Skew with log(x + c)

Solved – negative impact from imbalance/skew in predictor variables

Related Question