Data Transformation – Transforming Data: All Variables or Just the Non-Normal Ones?

data transformationnormal distribution

In Andy Field's Discovering Statistics Using SPSS he states that all variables have to be transformed.

However in the publication: "Examining spatially varying relationships between land use and water quality using geographically weighted regression I: Model design and evaluation" they specifically state that only the non-normal variables were transformed.

Is this analysis specific? For instance, in a comparison of means, comparing logs to raw data would obviously yield a significant difference, whereas when using something like regression to investigate the relationship between variables it becomes less important.

Edit: Here is the full text page in the "Data Transformation" section:

And here is the link to the paper:
http://www.sciencedirect.com/science/article/pii/S0048969708009121

Best Answer

You quote several pieces of advice, all of which is no doubt intended helpfully, but it is difficult to find much merit in any of it.

In each case I rely totally on what you cite as a summary. In the authors' defence I would like to believe that they add appropriate qualifications in surrounding or other material. (Full bibliographic references in usual name(s), date, title, (publisher, place) or (journal title, volume, pages) format would enhance the question.)

Field

This advice is intended helpfully, but is at best vastly oversimplified. Field's advice seems to be intended generally; for example, the reference to Levene's test implies some temporary focus on analysis of variance.

For example, suppose I have one predictor that on various grounds should be logged and another indicator variable that is $(1,0)$. The latter (a) cannot be logged (b) should not be logged. (Indeed any transformation of an indicator variable to any two distinct values has no important effect.)

More generally, it is common -- in many fields the usual situation -- that some predictors should be transformed and the rest left as is.

It's true that encountering in a paper or dissertation a mixture of transformations applied differently to different predictors (including as a special case, identity transformation, or leaving as is) is often a matter of concern for a reader. Is the mix a well thought out set of choices, or was it arbitrary and capricious?

Furthermore, in a series of studies consistency of approach (always applying logarithms to a response, or never doing it) does aid enormously in comparing results, and differing approach makes it more difficult.

But that's not to say there could never be reasons for a mix of transformations.

I don't see that most of the section you cite has much bearing on the key advice you highlight in yellow. This in itself is a matter of concern: it's a strange business to announce an absolute rule and then not really to explain it. Conversely, the injunction "Remember" suggests that Field's grounds were supplied earlier in the book.

Anonymous paper

The context here is regression models. As often, talking of OLS strangely emphasises estimation method rather than model, but we can understand what is intended. GWR I construe as geographically weighted regression.

The argument here is that you should transform non-normal predictors and leave the others as is. Again, this raises a question about what you can and should do with indicator variables, which cannot be normally distributed (which as above can be answered by pointing out that non-normality in that case is not a problem). But the injunction has it backwards in implying that it's non-normality of predictors that is the problem. Not so; it's no part of regression modelling to assume anything about marginal distributions of the predictors.

In practice, if you make predictors more nearly normal, then you will often be applying transformations that make the functional form $X\beta$ more nearly right for the data, which I would assert to be the major reason for transformation, despite the enormous emphasis on error structure in many texts. In other words, logging predictors to get them closer to normality can be doing the right thing for the wrong reason if you get closer to linearity in the transformed space.

There is so much extraordinarily good advice on transformations in this forum that I have focused on discussing what you cite.

P.S. You add a statement starting "For instance, in a comparison of means, comparing logs to raw data would obviously yield a significant difference." I am not clear what you have in mind, but comparing values for one group with logarithms of values for another group would just be nonsensical. I don't understand the rest of your statement at all.

Related Question