Solved – The order of Data Centering and Data Transformation

centeringdata transformationregression

Edit: I just read a related post (How to include $x$ and $x^2$ into regression, and whether to center them?) which mentions that centering a variable creates a new variable.

However, as the comments point out, taking the logarithm of negative values doesn't make sense (stupid me for not thinking this through) so I changed the first option.

I'm working with a multiple regression where log transforming a few of my predictors drastically improves the model assumptions. However, this improvement is for un-centered data and centered data on the mean would be much more interpretable.

I understand that centering data does not affect the distribution (it only shifts the mean), and would like to ask when I should center my data. Is there any general rule of thumb?

1] Do I center the predictor about its mean first and then search for a different transformation which improves model assumptions should they be violated?

2] Do I perform the log transformation first, then center by the mean of these log transformed values? How would this change model interpretation compared to option 1]?

Best Answer

If logarithms of predictors, generically $x$, are helpful, and centring variables on their mean is helpful, would it help to centre before transforming?

Once you have subtracted the mean from a variable, then necessarily at least one value is now negative and logarithms can't (usefully) be calculated (setting aside complex analysis).

Even if you discard the specific suggestion of $\log(x−$ mean of $x)$ on those grounds, the more general idea of transforming $(x−$ mean of $x)$ still

requires a transformation that will work with positive, zero and negative values; there are some (cube root, asinh, ...) but they won't usually help you in any situation in which logarithms are being contemplated seriously
implies that the mean of untransformed data is in some sense a natural or even a convenient origin for the transformed scale, which I think is usually not the case. So it's no go generally for your [1] in my view.

By all means, centre variables, transformed or not, in presenting regression results; it's the same regression and it's a matter of convenience how you explain it. So on your [2] I don't think it changes model interpretation at all; it's just convenience whether you write about centred results.

By the way, there is no "of course" about using $\log(x+1)$ even if $x \ge 0$. That's an ad hoc fudge that some people use, especially it seems in some branches of biology. But there is no standard or accepted logic to it.

Related Solutions

Solved – Why could centering independent variables change the main effects with moderation

In models with no interaction terms (that is, with no terms that are constructed as the product of other terms), each variable's regression coefficient is the slope of the regression surface in the direction of that variable. It is constant, regardless of the values of the variables, and therefore can be said to measure the overall effect of that variable.

In models with interactions, this interpretation can be made without further qualification only for those variables that are not involved in any interactions. For a variable that is involved in interactions, the "main-effect" regression coefficient -- that is, the regression coefficient of the variable by itself -- is the slope of the regression surface in the direction of that variable when all other variables that interact with that variable have values of zero, and the significance test of the coefficient refers to the slope of the regression surface only in that region of the predictor space. Since there is no requirement that there actually be data in that region of the space, the main-effect coefficient may bear little resemblance to the slope of the regression surface in the region of the predictor space where data were actually observed.

In anova terms, the main-effect coefficient is analogous to a simple main effect, not an overall main effect. Moreover, it may refer to what in an anova design would be empty cells in which the data were supplied by extrapolating from cells with data.

For a measure of the overall effect of the variable that is analogous to an overall main effect in anova and does not extrapolate beyond the region in which data were observed, we must look at the average slope of the regression surface in the direction of the variable, where the averaging is over the N cases that were actually observed. This average slope can be expressed as a weighted sum of the regression coefficients of all the terms in the model that involve the variable in question.

The weights are awkward to describe but easy to get. A variable's main-effect coefficient always gets a weight of 1. For each other coefficient of a term involving that variable, the weight is the mean of the product of the other variables in that term. For example, if we have five "raw" variables x1, x2, x3, x4, x5, plus four two-way interactions (x1,x2), (x1,x3), (x2,x3), (x4,x5), and one three-way interaction (x1,x2,x3), then the model is

y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4 + b5*x5 +
    b12*x1*x2 + b13*x1*x3 + b23*x2*x3 + b45*x4*x5 +
    b123*x1*x2*x3 + e

and the overall main effects are

B1 = b1 + b12*M[x2] + b13*M[x3] + b123*M[x2*x3],

B2 = b2 + b12*M[x1] + b23*M[x3] + b123*M[x1*x3],

B3 = b3 + b13*M[x1] + b23*M[x2] + b123*M[x1*x2],

B4 = b4 + b45*M[x5],

B5 = b5 + b45*M[x4],

where M[.] denotes the sample mean of the quantity inside the brackets. All the product terms inside the brackets are among those that were constructed in order to do the regression, so a regression program should already know about them and should be able to print their means on request.

In models that have only main effects and two-way interactions, there is a simpler way to get the overall effects: center[1] the raw variables at their means. This is to be done prior to computing the product terms, and is not to be done to the products. Then all the M[.] expressions will become 0, and the regression coefficients will be interpretable as overall effects. The values of the b's will change; the values of the B's will not. Only the variables that are involved in interactions need to be centered, but there is usually no harm in centering other measured variables. The general effect of centering a variable is that, in addition to changing the intercept, it changes only the coefficients of other variables that interact with the centered variable. In particular, it does not change the coefficients of any terms that involve the centered variable. In the example given above, centering x1 would change b0, b2, b3, and b23.

[1 -- "Centering" is used by different people in ways that differ just enough to cause confusion. As used here, "centering a variable at #" means subtracting # from all the scores on the variable, converting the original scores to deviations from #.]

So why not always center at the means, routinely? Three reasons. First, the main-effect coefficients of the uncentered variables may themselves be of interest. Centering in such cases would be counter-productive, since it changes the main-effect coefficients of other variables.

Second, centering will make all the M[.] expressions 0, and thus convert simple effects to overall effects, only in models with no three-way or higher interactions. If the model contains such interactions then the b -> B computations must still be done, even if all the variables are centered at their means.

Third, centering at a value such as the mean, that is defined by the distribution of the predictors as opposed to being chosen rationally, means that all coefficients that are affected by centering will be specific to your particular sample. If you center at the mean then someone attempting to replicate your study must center at your mean, not their own mean, if they want to get the same coefficients that you got. The solution to this problem is to center each variable at a rationally chosen central value of that variable that depends on the meaning of the scores and does not depend on the distribution of the scores. However, the b -> B computations still remain necessary.

The significance of the overall effects may be tested by the usual procedures for testing linear combinations of regression coefficients. However, the results must be interpreted with care because the overall effects are not structural parameters but are design-dependent. The structural parameters -- the regression coefficients (uncentered, or with rational centering) and the error variance -- may be expected to remain invariant under changes in the distribution of the predictors, but the overall effects will generally change. The overall effects are specific to the particular sample and should not be expected to carry over to other samples with different distributions on the predictors. If an overall effect is significant in one study and not in another, it may reflect nothing more than a difference in the distribution of the predictors. In particular, it should not be taken as evidence that the relation of the dependent variable to the predictors is different in the two studies.

Solved – Centering data in multiple regression

With continuous dependent variables, you can center these too if you want. Just don't forget that your predicted values have had the mean subtracted from them; otherwise, you should be able to interpret the results normally. If you're not sure whether you want to center in a case like this, or want to consider other issues, you might find this question useful: When conducting multiple regression, when should you center your predictor variables & when should you standardize them?

With categorical variables, the mean may not be appropriate to use for centering, and the data may not be appropriate for fitting a multiple regression model with ordinary least squares. When averaging a reasonably large number of Likert scale responses (say, across five or more items) with a reasonably wide set of options (five options might be enough), you might be okay in using the mean, but you should probably check whether your response frequencies for each item seem to be approximating a normal distribution (i.e., not a distribution with strong skew, excess kurtosis, a bimodal shape, etc.). When you average them across your set of items, check again to make sure these scores seems roughly normal.

If they're not, you might need to explore other methods for handling ordinal data in regression. Item response theory models like the rating scale model might be more suitable. You could also try fitting a structural equation model that relates the latent factors represented by your Likert rated items to your dependent variables using a polychoric correlation matrix. You might find my answer to a related question useful for this.

Best Answer

Related Solutions

Solved – Why could centering independent variables change the main effects with moderation

Solved – Centering data in multiple regression

Related Question