Solved – Why is skewed data not preferred for modelling

modelingskewness

Most of the times when people talk about variable transformations (for both predictor and response variables), they discuss ways to treat skewness of the data (like log transformation, box and cox transformation etc.). What I am not able to understand is why removing skewness is considered such a common best practice? How does skewness impact performance of various kinds of models like tree based models, linear models and non-linear models? What kind of models are more affected by skewness and why?

Best Answer

When removing skewness, transformations are attempting to make the dataset follow the Gaussian distribution. The reason is simply that if the dataset can be transformed to be statistically close enough to a Gaussian dataset, then the largest set of tools possible are available to them to use. Tests such as the ANOVA, $t$-test, $F$-test, and many others depend on the data having constant variance ($\sigma^2$) or follow a Gaussian distribution.1

There are models that are more robust1 (such as using Levine's test instead of Bartlett's test), but most tests and models which work well with other distributions require that you know what distribution you are working with and are typically only appropriate for a single distribution as well.

To quote the NIST Engineering Statistics Handbook:

In regression modeling, we often apply transformations to achieve the following two goals:

  1. to satisfy the homogeneity of variances assumption for the errors.
  2. to linearize the fit as much as possible.

Some care and judgment is required in that these two goals can conflict. We generally try to achieve homogeneous variances first and then address the issue of trying to linearize the fit.

and in another location

A model involving a response variable and a single independent variable has the form:

$$Y_i=f\left(X_i\right)+E_i$$

where $Y$ is the response variable, $X$ is the independent variable, $f$ is the linear or non-linear fit function, and $E$ is the random component. For a good model, the error component should behave like:

  1. random drawings (i.e., independent);
  2. from a fixed distribution;
  3. with fixed location; and
  4. with fixed variation.

In addition, for fitting models it is usually further assumed that the fixed distribution is normal and the fixed location is zero. For a good model the fixed variation should be as small as possible. A necessary component of fitting models is to verify these assumptions for the error component and to assess whether the variation for the error component is sufficiently small. The histogram, lag plot, and normal probability plot are used to verify the fixed distribution, location, and variation assumptions on the error component. The plot of the response variable and the predicted values versus the independent variable is used to assess whether the variation is sufficiently small. The plots of the residuals versus the independent variable and the predicted values is used to assess the independence assumption.

Assessing the validity and quality of the fit in terms of the above assumptions is an absolutely vital part of the model-fitting process. No fit should be considered complete without an adequate model validation step.


  1. (abbreviated) citations for claims:
    • Breyfogle III, Forrest W. Implementing Six Sigma
    • Pyzdek, Thomas. The Six Sigma Handbook
    • Montgomery, Douglas C. Introduction to Statistical Quality Control
    • Ed. Cubberly, Willaim H and Bakerjan, Ramon. Tool and Manufacturing Engineers Handbook: Desktop Edition