Solved – Mean centering – before regression or observations that enter regression

centeringdata transformationinteraction

I am using Stata 13 to estimate a simple model with interaction terms. To give the coefficients a meaningful interpretation at zero, and to avoid multicollinearity, I am mean centering variables.

I am wondering when to do this. I.e. before estimating a regression or only for values that enter the regression? The question stems from the missing structure of my data. Because the mean of the centered variable is not zero when calculated for the observations that acctually entered the regession.

Maybe an example helps in making the point:

clear
set more off

sysuse auto.dta

*Randomly replace weight with missings 
gen tomis = ceil(10*runiform())
replace weight=. if tomis==1

*Center mpg
sum mpg, meanonly
gen cmpg = mpg-r(mean)

*Regression 
qui reg price cmpg weight foreign
qui gen sample = 1 if e(sample)

*Center mpg when in sample
sum mpg if sample==1, meanonly
gen cmpgs= mpg-r(mean)

*Sums
sum mpg cmpg cmpgs
sum mpg cmpg cmpgs if sample==1

In the example above I mean center mpg to cmpg. The mean of cmpg is thus (close to) zero. However the mean of cmpg is 0.278 for all observations that entered the regression. Does that make sense or should I center based on the observation that enter the regression as I do when generated cmpgs?

Best Answer

The p-value of the two versions of cmpg will be the same, and whether it's pre- or post-regression centering is only a matter of your choice. Really, you don't need any poll to make a decision, as long as you explain it clearly in the Methods section you're all set.

Practically, I would slightly favor centering the variables with cases that will be in the model (aka after list-wise deletion.) The reason is that it's a lot more natural to read:

Cases with missing values were excluded in this analysis. Continuous independent variables were then centered at mean before the regression analysis.

than:

Continuous independent variables were centered at mean. Cases with missing values were excluded from the analysis.

Both will give the same slope and p-value for cmpg, but the second one is more likely to cause confusion for people who understand enough about this technique but not enough to realize the two methods are nearly the same.

However, given the missing structure in my data I hardly know if the sample is in any way representative.

Not knowing the nature of missing is actually a much bigger issue here although it's not the focus of the question. Lacking this knowledge or even assumption can undermine possible understanding of potentially very large biases.