Solved – The order of Data Centering and Data Transformation

centeringdata transformationregression

Edit: I just read a related post (How to include $x$ and $x^2$ into regression, and whether to center them?) which mentions that centering a variable creates a new variable.

However, as the comments point out, taking the logarithm of negative values doesn't make sense (stupid me for not thinking this through) so I changed the first option.


I'm working with a multiple regression where log transforming a few of my predictors drastically improves the model assumptions. However, this improvement is for un-centered data and centered data on the mean would be much more interpretable.

I understand that centering data does not affect the distribution (it only shifts the mean), and would like to ask when I should center my data. Is there any general rule of thumb?

1] Do I center the predictor about its mean first and then search for a different transformation which improves model assumptions should they be violated?

2] Do I perform the log transformation first, then center by the mean of these log transformed values? How would this change model interpretation compared to option 1]?

Best Answer

If logarithms of predictors, generically $x$, are helpful, and centring variables on their mean is helpful, would it help to centre before transforming?

Once you have subtracted the mean from a variable, then necessarily at least one value is now negative and logarithms can't (usefully) be calculated (setting aside complex analysis).

Even if you discard the specific suggestion of $\log(x−$ mean of $x)$ on those grounds, the more general idea of transforming $(x−$ mean of $x)$ still

  1. requires a transformation that will work with positive, zero and negative values; there are some (cube root, asinh, ...) but they won't usually help you in any situation in which logarithms are being contemplated seriously

  2. implies that the mean of untransformed data is in some sense a natural or even a convenient origin for the transformed scale, which I think is usually not the case. So it's no go generally for your [1] in my view.

By all means, centre variables, transformed or not, in presenting regression results; it's the same regression and it's a matter of convenience how you explain it. So on your [2] I don't think it changes model interpretation at all; it's just convenience whether you write about centred results.

By the way, there is no "of course" about using $\log(x+1)$ even if $x \ge 0$. That's an ad hoc fudge that some people use, especially it seems in some branches of biology. But there is no standard or accepted logic to it.