I am using Stata 13 to estimate a simple model with interaction terms. To give the coefficients a meaningful interpretation at zero, and to avoid multicollinearity, I am mean centering variables.
I am wondering when to do this. I.e. before estimating a regression or only for values that enter the regression? The question stems from the missing structure of my data. Because the mean of the centered variable is not zero when calculated for the observations that acctually entered the regession.
Maybe an example helps in making the point:
clear
set more off
sysuse auto.dta
*Randomly replace weight with missings
gen tomis = ceil(10*runiform())
replace weight=. if tomis==1
*Center mpg
sum mpg, meanonly
gen cmpg = mpg-r(mean)
*Regression
qui reg price cmpg weight foreign
qui gen sample = 1 if e(sample)
*Center mpg when in sample
sum mpg if sample==1, meanonly
gen cmpgs= mpg-r(mean)
*Sums
sum mpg cmpg cmpgs
sum mpg cmpg cmpgs if sample==1
In the example above I mean center mpg
to cmpg
. The mean of cmpg
is thus (close to) zero. However the mean of cmpg
is 0.278 for all observations that entered the regression. Does that make sense or should I center based on the observation that enter the regression as I do when generated cmpgs
?
Best Answer
The p-value of the two versions of
cmpg
will be the same, and whether it's pre- or post-regression centering is only a matter of your choice. Really, you don't need any poll to make a decision, as long as you explain it clearly in the Methods section you're all set.Practically, I would slightly favor centering the variables with cases that will be in the model (aka after list-wise deletion.) The reason is that it's a lot more natural to read:
than:
Both will give the same slope and p-value for
cmpg
, but the second one is more likely to cause confusion for people who understand enough about this technique but not enough to realize the two methods are nearly the same.Not knowing the nature of missing is actually a much bigger issue here although it's not the focus of the question. Lacking this knowledge or even assumption can undermine possible understanding of potentially very large biases.