Solved – Confusion related to which transformation to use

data miningdata transformation

I have this confusion about which transformation to use in my data. The histogram of my original data looks like this

enter image description here

Now I have seen at most of the places to take log transformation in case the data is positively skewed. But when I take log transformation I get something like this
which is negatively skewed, not what I desire.

enter image description here

If I take square root transformation and cube root transformation I get like this

enter image description here
enter image description here

Now the data is pretty much close to normal. But I didn't get the intuition behind this. Why didn't it work with the log transformation when I see at so many places people mentioning about log transformation useful when data is positively skewed. Here square root and cube root worked.

I want to know at which condition we should use log transformation and at which condition to use cube root transformation. Suggestions?

Best Answer

It's not clear from your question why you need to transform at all. (What are you trying to achieve and why?)

As for why logs might make the appearance more symmetric in some cases and not others, not all distributions are the same - while log transformations may sometimes make skewed data nearly symmetric, there's no guarantee that it always does.

Often other transformations do much better.

For example logs work very nicely on lognormal distributions, while cube roots do better on gamma. Below, $a$ is simulated from a lognormal distribution, and $b$ from a gamma distribution. They look vaguely similar, but the log-transform makes $a$ symmetric (in fact, normal), while making $b$ left-skewed. On the other hand a cube root transformation leaves $a$ still somewhat right skew, but makes $b$ very nearly symmetric (and pretty close to normal):

log vs cube root, lognormal vs gamma

Other times there's simply no monotonic transformation to achieve approximate symmetry (e.g. if your distribution is discrete and sufficiently skew, like a geometric(0.5), or say a Poisson(0.5), no monotonic transformation can make it reasonably normal - wherever you put them, the leftmost spike will always be taller than the next one).

Incidentally, you might want to use more bars on your histograms, and maybe consider using other displays as well, to get a handle on the distributional shape. See my cautionary tale.