Solved – Centering data in multiple regression

centeringmultiple regression

In a multiple regression analysis (with 4 continuous predictors and 2 categorical factors), we mean centered the data (for each continuous variable) due to issues of multicollinearity when the interaction terms are included.

My question is whether I can center the response variable too.

More specifically, the response variable and the 4 continuous predictors are all averaged survey responses (using scales of 1 to 5). I originally thought that if I were centering the explanatory variables, I might as well center the response.. but I realize that most references to centering seem to only apply to the predictor variables.

Any help is much appreciated.

Update: as I mentioned in an earlier comment, I questioned the validity of centering my response variable due to having different ANOVA results using centered versus non-centered response. My interpretation is that the linear model for a non-centered response (using the lm() in R) uses the mean of the 'reference level' of my two factors for computing the intercept. When I centered the response variable, it's being subtracted from the grand mean of this variable, rather than the mean of the reference level. Now I've verified that by 'centering' using the reference level mean and it does indeed yield identical results for p-values as the non-centered response model. I hope I am interpreting this issue correctly. IF someone could further confirm/clarify this I would really appreciate that.

Best Answer

With continuous dependent variables, you can center these too if you want. Just don't forget that your predicted values have had the mean subtracted from them; otherwise, you should be able to interpret the results normally. If you're not sure whether you want to center in a case like this, or want to consider other issues, you might find this question useful: When conducting multiple regression, when should you center your predictor variables & when should you standardize them?

With categorical variables, the mean may not be appropriate to use for centering, and the data may not be appropriate for fitting a multiple regression model with ordinary least squares. When averaging a reasonably large number of Likert scale responses (say, across five or more items) with a reasonably wide set of options (five options might be enough), you might be okay in using the mean, but you should probably check whether your response frequencies for each item seem to be approximating a normal distribution (i.e., not a distribution with strong skew, excess kurtosis, a bimodal shape, etc.). When you average them across your set of items, check again to make sure these scores seems roughly normal.

If they're not, you might need to explore other methods for handling ordinal data in regression. Item response theory models like the rating scale model might be more suitable. You could also try fitting a structural equation model that relates the latent factors represented by your Likert rated items to your dependent variables using a polychoric correlation matrix. You might find my answer to a related question useful for this.

Related Question