Box-Cox Transformation – Intuition Behind the Box-Cox Transform

data transformationintuition

For features that are heavily skewed, the Transformation technique is useful to stabilize variance, make the data more normal distribution-like, improve the validity of measures of association.

I am really having trouble understanding the intuition behind Box-Cox transform. I mean how to configure data transform method for both square root and log transform and estimating lambda.

Could anyone explain in simple words (and maybe with an example) what is the Intuition behind Box-Cox transform

Best Answer

The design goals of the family of Box-Cox transformations of non-negative data were these:

The formulas should be simple, straightforward, well understood, and easy to calculate.
They should not change the middle of the data much, but affect the tails more.
The family should be rich enough to induce large changes in the skewness of the data if necessary: this means it should be able to contract or extend one tail of the data while extending or contracting the other, by arbitrary amounts.

Let's consider the implications of each in turn.

1. Simplicity

Linear transformations--those of the form $x\to \alpha x + \beta$ for constants $\alpha$ and $\beta$--merely change the scale and location of data; they cannot change the shape of their distribution. The next simplest formula is to consider power transformations, of the form $x\to x^\lambda$ for (nonzero) constant $\lambda.$

2. Stability

A power transformation enjoys the nice property that rescaling the data results in rescaling their powers. That is, multiplying the data $x$ by some positive scale factor $\alpha$ results in multiplying $x^\lambda$ by $\alpha^\lambda.$ OK, it's not the same scale factor, but it is still just a rescaling.

In light of this, let's always standardize any batch of data $(x_1, x_2, \ldots, x_n)$ by rescaling it to place its center (perhaps its median) at $1.$ Specifically, this replaces each $x_i$ by $x_i$ divided by the middle value of all the $x$'s. This won't change the shape of the data distribution--it really amounts to choosing a suitable unit of measurement for expressing the values. For those who like formulas, let $\mu$ be the median of the batch. We will be studying the transformations

$$x \to \frac{(x/\mu)^\lambda - 1}{\lambda} = \frac{\mu^{-\lambda}}{\lambda}\,x^\lambda + \frac{-1}{\lambda} = \alpha\, x^\lambda + \beta$$

for various $\lambda.$ The effects of $\alpha$ and $\beta$ (which depend on $\lambda$ and $\mu$) on $x^\lambda$ do not change the shape of the distribution of the $x_i^\lambda.$ In this sense, the Box-Cox transformations of the standardized data really are just the power transformations.

Because we have made $1$ the central value of the batch, design crition 2--"stability"--requires that different values of the power $\lambda$ have relatively little effect on values near $1.$

Let's look at this in a little more detail by examining what a power does to numbers near $1.$ According to the Binomial Theorem, if we write $x$ as $x=1+\epsilon$ (for fairly small $\epsilon$), then approximately

$$(1 + \epsilon)^\lambda = 1 + \lambda \epsilon + \text{Something}\times \epsilon^2.$$

Ignoring $\epsilon^2$ as being truly tiny, this tells us that

Taking a power $\lambda$ of a number $x$ near $1$ is a nearly linear function that changes the distance between $x$ and $1$ by a factor $\lambda.$

In light of this, we can match the effects of different possible $\lambda$ by means of a compensating division of the distance by $\lambda.$ That is, we will use

$$\operatorname{BC}_\lambda(x) = \frac{x^\lambda - 1^\lambda}{\lambda} = \frac{x^\lambda - 1}{\lambda}.$$

The numerator is the (signed) distance between the power transform of $x$ and the power transform of the middle of the data ($1$); the denominator adjusts for the expansion of $x-1$ by the factor $\lambda$ when taking the power. $\operatorname{BC}_\lambda$ is the Box-Cox transformation with parameter $\lambda.$

By means of this construction, we guarantee that when $x$ is close to a typical value of its batch of data, $\operatorname{BC}_\lambda(x)$ will approximately be the same value (and close to zero) no matter what $\lambda$ might be (within reason, of course: extreme values of $\lambda$ can do extreme things).

3. Flexibility

We have many possible values of $\lambda$ to choose from. How do they differ?

This can be explored by graphing the Box-Cox transformations for various $\lambda.$ Here is a set of graphs for $\lambda \in \{-1,-1/2, 0, 1/2, 1, 2\}.$ (For the meaning of $\lambda=0,$ see Natural Log Approximation elsewhere on this site.)

The solid black line graphs the Box-Cox transformation for $\lambda=1,$ which is just $x\to x-1.$ It merely shifts the center of the batch to $0$ (as do all the Box-Cox transformations). The upward curving pink graph is for $\lambda=2.$ The downward curving graphs show, in order of increasing curvature, the smaller values of $\lambda$ down to $-1.$

The differing amounts and directions of curvature provide the desired flexibility to change the shape of a batch of data.

For instance, the upward curving graph for $\lambda=2$ exemplifies the effect of all Box-Cox transformations with $\lambda$ exceeding $1:$ values of $x$ above $1$ (that is, greater than the middle of the batch, and therefore out in its upper tail) are pulled further and further away from the new middle (at $0$). Values of $x$ below $1$ (less than the middle of the batch, and therefore out in its lower tail) are pushed closer to the new middle. This "skews" the data to the right, or high values (rather strongly, even for $\lambda=2$).

The downward curving graphs, for $\lambda \lt 1,$ have the opposite effect: they push the higher values in the batch towards the new middle and pull the lower values away from the new middle. This skews the data to the left (or lower values).

The coincidence of all the graphs near the point $(1,0)$ is a result of the previous standardizations: it constitutes visual verification that choice of $\lambda$ makes little difference for values near the middle of the batch.

Finally, let's look at what different Box-Cox transformations do to a small batch of data.

Transformed values are indicated by the horizontal positions. (The original data look just like the black dots, shown at $\lambda=1,$ but are located $+1$ units to the right.) The colors correspond to the ones used in the first figure. The underlying gray lines show what happens to the transformed values when $\lambda$ is smoothly varied from $-1$ to $+2.$ It's another way of appreciating the effects of these transformations in the tails of the data. (It also shows why the value of $\lambda=0$ makes sense: it corresponds to taking values of $\lambda$ arbitrarily close to $0.$)

Best Answer

1. Simplicity

2. Stability

3. Flexibility

Related Solutions

Solved – T-test / ANOVA on Box-Cox transformed non-normal data

Solved – Confusion related to which transformation to use

Related Question