Data Transformation – Choosing c to Remove Skew with log(x + c)

data transformationskewness

I have data for which I would like to take the log transformation before doing OLS. The data include zeros. Thus, I want to do a log(x + c). I know a traditional c to choose is 1. I am wondering whether there is a way to have the data choose c such that there would be no skew anymore using features like the sample mean or variance? Is there a formaula?

Best Answer

Because the intention is to do OLS, the choice of $c$ should be made in this context.

In general, we ought to fit $c$ simultaneously with the rest of the regression. A quick and dirty way to do this recognizes that the regression $R^2$ is proportional to the log likelihood, so we could seek a value of $c$ that maximizes $R^2$.

This is a special example of the problem of choosing among a parameterized family of transformations $y \to f(y; \theta)$ to achieve the best possible fit of $y$ to explanatory values $x$. This can be solved in R rather simply and directly:

xform <- function(f, theta, x, y, ...) {
  g <- function(theta) -summary(lm(f(y, theta) ~ x))$r.squared
  nlm(g, theta, ...)
}

(I am glossing over a somewhat delicate matter of choosing good starting values for the parameter: it is possible to obtain bad solutions with nlm otherwise. Standard methods of exploratory data analysis will produce decent starting values, but that's a subject for another day.)

As an example of the use of xform, let's generate some highly skewed data for $y$ for which the "started logarithm" $\log(y+c)$ will produce an unskewed distribution:

set.seed(17)
y <- sort(exp(rnorm(32, 4, 1))) + 100

Evidently $\log(y-100)$ is drawn iid from a Normal distribution.

I will apply xform to three choices of $x$:

  1. Values from which $y$ differs by additive, homoscedastic residuals. In this case it would be a mistake to take the logarithm of $y$: it is a grossly incorrect model of the relationship between $x$ and $y$.

  2. Values from which $y$ differs by multiplicative lognormal residuals (more or less). In this case, taking the logarithm of $y$ is a good idea because it leads to a model for which OLS regression is appropriate.

  3. Constant values of $x$, so that in effect we are looking at $y$ outside of the regression context altogether.

In cases (1) and (2) I will plot the histogram of $y$ (to show it is highly skewed), the scatterplot of $y$ against $x$ (to exhibit the data), and the scatterplot of the transformed $y$ against $x$ with the OLS line superimposed, to see the result of the transformation. In the third case those scatterplots are meaningless, so I only report the value of $c$ found by xform.

set.seed(17)
par(mfrow=c(2,3))

y <- sort(exp(rnorm(32, 4, 1))) + 100
x <- y - rnorm(length(y), 0, 50)
hist(y)
plot(x,y)
const <- xform(function(y,c) log(y+c), 1-min(y), x, y)$estimate
plot(x, log(y + const), ylab=sprintf("log(y + %0.1f)", const))
abline(coef(fit1<-lm(log(y+const)~x)), col="Gray")

x <- log((1:length(y) + rnorm(length(y), 10, 3)))
hist(y)
plot(x,y)
const <- xform(function(y,c) log(y+c), 1-min(y), x, y)$estimate
plot(x, log(y + const), ylab=sprintf("log(y + %0.1f)", const))
abline(coef(fit2<-lm(log(y+const)~x)), col="Gray")

x <- rep(1, length(y))
const <- xform(function(y,c) log(y+c), 1-min(y), x, y)$estimate

Plots

The top row is the first case and the second row of plots are for the second case.

Please observe:

  • The $y$ values are identical in all three instances.

  • The $y$ values are constructed from a model in which $c=-100$.

  • The fitted value of $c$ in the first case, $1408392.5$, is essentially infinite. This indicates it's bad to be taking the logarithm at all for these $x$ values. (Adding this huge value of $c$ to $y$ before taking the logarithm basically does not change the shape of the data: that's why the two scatterplots in the top row look the same.)

  • The fitted value of $c$ in the second case, $-104.0$, is close to the value of $-100$ used to generate the data. (Repeated simulations indicate that the fitted value in the second case will be biased slightly low, averaging around $-105$.)

  • The fitted value of $c$ in the third case is $-115.8$, still close to the value used to generate the data.

If we were to use the "universal" value of $c$ found in the third case (essentially by ignoring the $x$ values), here is what the scatterplot would look like in conjunction with the $x$ values from case 1:

Scatterplot

For these particular $x$ values, the OLS line fit to the transformed $y$ values is a terrible description of the relationship between $y$ and $x$. Notice how it underestimates most values of $y$ but grossly overestimates a few of them for $x$ between $140$ and $200$.

In summary, if you want to transform the response variable for a regression (to achieve symmetry or linearity), you must account for the regression itself. This is because the regression only "cares" about the residuals, not the raw values of $y$. As the extraordinarily bad value of $c$ in the first case shows, ignoring this advice could produce awful results.

Related Question