Data Transformation – Techniques in Regression

data transformationlogisticrregression

When transforming variables, do you have to use all of the same transformation? For example, can I pick and choose differently transformed variables, as in:

Let, $x_1,x_2,x_3$ be age, length of employment, length of residence, and income.

Y = B1*sqrt(x1) + B2*-1/(x2) + B3*log(x3)

Or, must you be consistent with your transforms and use all of the same? As in:

Y = B1*log(x1) + B2*log(x2) + B3*log(x3) 

My understanding is that the goal of transformation is to address the problem of normality. Looking at histograms of each variable we can see that they present very different distributions, which would lead me to believe that the transformations required are different on a variable by variable basis.

## R Code
df <- read.spss(file="http://www.bertelsen.ca/R/logistic-regression.sav", 
                use.value.labels=T, to.data.frame=T)
hist(df[1:7]) 

alt text

Lastly, how valid is it to transform variables using $\log(x_n + 1)$ where $x_n$ has $0$ values? Does this transform need to be consistent across all variables or is it used adhoc even for those variables which do not include $0$'s?

## R Code 
plot(df[1:7])

alt text

Best Answer

One transforms the dependent variable to achieve approximate symmetry and homoscedasticity of the residuals. Transformations of the independent variables have a different purpose: after all, in this regression all the independent values are taken as fixed, not random, so "normality" is inapplicable. The main objective in these transformations is to achieve linear relationships with the dependent variable (or, really, with its logit). (This objective over-rides auxiliary ones such as reducing excess leverage or achieving a simple interpretation of the coefficients.) These relationships are a property of the data and the phenomena that produced them, so you need the flexibility to choose appropriate re-expressions of each of the variables separately from the others. Specifically, not only is it not a problem to use a log, a root, and a reciprocal, it's rather common. The principle is that there is (usually) nothing special about how the data are originally expressed, so you should let the data suggest re-expressions that lead to effective, accurate, useful, and (if possible) theoretically justified models.

The histograms--which reflect the univariate distributions--often hint at an initial transformation, but are not dispositive. Accompany them with scatterplot matrices so you can examine the relationships among all the variables.


Transformations like $\log(x + c)$ where $c$ is a positive constant "start value" can work--and can be indicated even when no value of $x$ is zero--but sometimes they destroy linear relationships. When this occurs, a good solution is to create two variables. One of them equals $\log(x)$ when $x$ is nonzero and otherwise is anything; it's convenient to let it default to zero. The other, let's call it $z_x$, is an indicator of whether $x$ is zero: it equals 1 when $x = 0$ and is 0 otherwise. These terms contribute a sum

$$\beta \log(x) + \beta_0 z_x$$

to the estimate. When $x \gt 0$, $z_x = 0$ so the second term drops out leaving just $\beta \log(x)$. When $x = 0$, "$\log(x)$" has been set to zero while $z_x = 1$, leaving just the value $\beta_0$. Thus, $\beta_0$ estimates the effect when $x = 0$ and otherwise $\beta$ is the coefficient of $\log(x)$.

Related Question