Solved – Life after the Box-Cox transformation

data transformationnormal distribution

Suppose, we have a set of measurements of some quantity in some units of measurement. We also have a nice model that heavily relies on the properties of the Gaussian distribution. The model is tailored for data in some specific units of measurement with some physical meaning behind (like watt, ohm, etc.). It turns out that the distribution of the data does not exactly follow the normal distribution and has some undesired features (like skewness). We apply the popular Box-Cox transformation and obtain a more or less normally distributed data set. The problem now is that we have logarithms, powers, etc. of the original measurements, which contradicts with our nice model.

The question is, what can one do in such a situation? I need to change the model such that it can handle the new data? And in general, if I got everything correctly, why do people what to study transformed data that have lost their physical meaning? Because, at the end of the day, one will, probably, have to return back to the original units of measurement.

Best Answer

First of all, if you mean a linear regression model, it does not assume the data are normally distributed, it assumes the error as estimated by the residuals is normally distributed (in fact, they should be iid $\mathcal{N}(0,\sigma)$).

Second, if that assumption is violated and you want to keep your original units, you can use some other form of regression - there are a variety of robust regression models, loess models, spline models, etc.

Related Solutions

Data Transformation – How to Express Answers in Terms of Original Units in Box-Cox Transformed Data

If you want inferences specifically about the mean of the original variable, then don't use Box-Cox transformation. IMO Box-Cox transformations are most useful when the transformed variable has its own interpretation, and the Box-Cox transformation only helps you to find the right scale for analysis - this turns out to be the case surprisingly often. Two unexpected exponents that I found this way were 1/3 (when the response variable was bladder volume) and -1 (when the response variable was breaths per minute).

The log-transformation is probably the only exception to this. The mean on the log-scale corresponds to the geometric mean in the original scale, which is at least a well-defined quantity.

Data Transformation – Methods to Increase Kurtosis and Skewness of Normal Random Variables

This can be done using the sinh-arcsinh transformation from

Jones, M. C. and Pewsey A. (2009). Sinh-arcsinh distributions. Biometrika 96: 761–780.

The transformation is defined as

$$H(x;\epsilon,\delta)=\sinh[\delta\sinh^{-1}(x)-\epsilon], \tag{$\star$}$$

where $\epsilon \in{\mathbb R}$ and $\delta \in {\mathbb R}_+$. When this transformation is applied to the normal CDF $S(x;\epsilon,\delta)=\Phi[H(x;\epsilon,\delta)]$, it produces a unimodal distribution whose parameters $(\epsilon,\delta)$ control skewness and kurtosis, respectively (Jones and Pewsey, 2009), in the sense of van Zwet (1969). In addition, if $\epsilon=0$ and $\delta=1$, we obtain the original normal distribution. See the following R code.

fs = function(x,epsilon,delta) dnorm(sinh(delta*asinh(x)-epsilon))*delta*cosh(delta*asinh(x)-epsilon)/sqrt(1+x^2)

vec = seq(-15,15,0.001)

plot(vec,fs(vec,0,1),type="l")
points(vec,fs(vec,1,1),type="l",col="red")
points(vec,fs(vec,2,1),type="l",col="blue")
points(vec,fs(vec,-1,1),type="l",col="red")
points(vec,fs(vec,-2,1),type="l",col="blue")

vec = seq(-5,5,0.001)

plot(vec,fs(vec,0,0.5),type="l",ylim=c(0,1))
points(vec,fs(vec,0,0.75),type="l",col="red")
points(vec,fs(vec,0,1),type="l",col="blue")
points(vec,fs(vec,0,1.25),type="l",col="red")
points(vec,fs(vec,0,1.5),type="l",col="blue")

Therefore, by choosing an appropriate sequence of parameters $(\epsilon_n,\delta_n)$, you can generate a sequence of distributions/transformations with different levels of skewness and kurtosis and make them look as similar or as different to the normal distribution as you want.

The following plot shows the outcome produced by the R code. For (i) $\epsilon=(-2,-1,0,1,2)$ and $\delta=1$, and (ii) $\epsilon=0$ and $\delta=(0.5,0.75,1,1.25,1.5)$.

enter image description here

enter image description here

Simulation of this distribution is straightforward given that you just have to transform a normal sample using the inverse of $(\star)$.

$$H^{-1}(x;\epsilon,\delta)=\sinh[\delta^{-1}(\sinh^{-1}(x)+\epsilon)]$$

Related Question