Solved – p-values change after mean centering with interaction terms. How to test for significance

centeringinteractionlinear modelmultiple regressionstatistical significance

I assumed the following interaction model:

$$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_3 + \beta_4 x_2 x_3$$

And then applied mean centering:

$$y = \beta_0 + \beta_1(x_1 – \bar{x_1}) + \beta_2(x_2 – \bar{x_2}) + \beta_3(x_3 – \bar{x_3}) + \beta_4(x_2 – \bar{x_2})(x_3 – \bar{x_3})$$

I ran linear regression analysis with statsmodels library in Python. The following is the result I obtained:

In the original model, the analysis result is saying that both $x_2$ and $x_3$ are statistically insignificant, while the mean-centered model says that everything is significant.

Let's say that my goal is to find out which features have meaningful impact on predicting $y$. Which p-value should I use for significance test of features?

++This answer says that:

The reported p-values for the coefficient for z will differ between the uncentered and x-centered models. That might seem troubling at first, but that's OK. The correct test for significance of a predictor involved in an interaction must involve both its individual coefficient and its interaction coefficient, and the result of that test is unchanged by centering.

But I do not understand what it means by "correct test for significance". Can someone explain what he's referring to?

Best Answer

But I do not understand what it means by "correct test for significance". Can someone explain what he's referring to?

If I were you I would post a comment to that answer by @EdM, otherwise, unless they actually see this question and answers themself, we can only make an informed guess. Having said that, what I think is meant by that statement, is that the model must include both the main effect and the interaction in order to make correct inferences. There could be some rare cases where it is not necessary to include the main effect, but as a good general rule, you should.

Now, looking at the output from your two models, the first thing I notice is:

the condition number is large, 2.17e+03. his might indicate that there are strong multicollinearity or other numerical problems

and also note that this warning is absent from the centered model.

One consequence of muticollinearity is that it can inflate standard errors, which increases p values. Your model contains an interaction which is a product of two other variables. Depending on the scale it might be the case that there is a high correlation between the interaction and the variables themselves and this could cause inflated p values. Centering variables often reduces correlation between them when nonlinear terms (such as an interaction) are included. Without access to the data itself it is hard to say if this is what is actually happening, but it's my best informed guess. Your first point of call should be a correlation matrix between all the predictors and this will give you a big hint if this is actually the cause.

However, further inspection of the output reveals that the R squared for both models is 1. This indicates that there is a problem somewhere. Without access to the data it is very difficult to see where that might be.

As to the reason why the estimates an p values for the main effects change after centering, first, note that in a model without an interaction term, mean-centering the variables will change only the intercept term. The coefficients and their standard errors for the other variables will be unchanged. However, in the presence of an interaction, the main effects no longer have the same interpretation. They are interpreted as the change in the outcome variable for a 1 unit change of the variable in question, when the other main effect that it is interacted with is at zero (or in the case of a categorical variable, its reference level). This implies that, after centering the variables, the estimates and their standard errors for the main effects that are involved in an interaction will change (and hence the p values too), because zero now has a different meaning after centering, but the estimate and the standard error for the interaction itself will remain unchanged. In other words the tests are different. Looking at the output, this is exactly what has happened.

Edit: To provide better understanding:

To understand the last point more fully we can write out the equations for two simple models, one without centering, and one with centering, with two predictors, $x_1$ and $x_2$ along with their interaction.

Firstly, the original (uncentered) model is:

$$\mathbb{E}[Y] = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_3 x_1x_2$$

Denote the centered variables by $z_1$ and $z_2$, such that

$$ \begin{align} z_1 &= x_1 - \mu_1 \text{ and} \\ z_2 &= x_2 - \mu_2 \end{align} $$ where $\mu_1$ and $\mu_2$ are the means of $x_1$ and $x_2$ respectively. We can now write the model with centering in terms of the centered variables and the means of the uncentered variables:

$$\mathbb{E}[Y] = \beta_0 + \beta_1 (z_1 + \mu_1) + \beta_2 (z_2 + \mu_2) + \beta_3 (z_1 + \mu_1) (z_2 + \mu_2)$$

Expanding:

$$\mathbb{E}[Y] = \beta_0 + \beta_1 z_1 + \beta_1 \mu_1 + \beta_2 z_2 + \beta_2\mu_2 + \beta_3 z_1 z_2 +\beta_3 z_1 \mu_2 +\beta_3 z_2 \mu_1 + \beta_3 \mu_1 \mu_2 $$

Now, note that $\beta_1 \mu_1$, $\beta_2\mu_2$ and $\beta_3 \mu_1 \mu_2$ are all constant so these can be subsumed into a new intercept, $\gamma_0$, giving:

$$\mathbb{E}[Y] = \gamma_0 + \beta_1 z_1 + \beta_2 z_2 + \beta_3 z_1 z_2 +\beta_3 z_1 \mu_2 +\beta_3 z_2 \mu_1 $$

Rearranging this by factorizing by $z_1$, $z_2$ and $z_1 z_2$ we arrive at:

$$\mathbb{E}[Y] = \gamma_0 + z_1 (\beta_1 + \beta_3 \mu_2 ) + z_2 (\beta_2 + \beta_3 \mu_1) + z_1 z_2 \beta_3 $$

So, this is the simplified form of the regression model using the centered variables. We can immediately note that:

the intercept will be different from the uncentered model, since it is now equal to $ \gamma_0 = \beta_0 + \beta_1 \mu_1 +\beta_2\mu_2 +\beta_3 \mu_1 \mu_2$
the test for $z_1$ is comparing $\beta_1 + \beta_3 \mu_2$ to zero, or equivalently the equality of $\beta_1$ and $-\beta_3 \mu_2$, which will only be the same as the test for $\beta_1$ in the uncentered model if $\mu_2$ is zero, which obviously it isn't otherwise you wouldn't be centering $x_2$ in the first place.
similarly, the test for $z_2$ is comparing $\beta_2 + \beta_3 \mu_1$ to zero, which will only be the same as the test for $\beta_2$ in the uncentered model if $\mu_1$ is zero.
The test for $z_1 z_2$ is comparing $\beta_3$ to zero, which is the same as in the uncentered model.

Again, inspecting the output of both models, this is exactly what is happening.

To sum up, although the two models are the same, ie the centered model is just a re-parameterization of the uncentered model, the p values for the tests of the estimated coefficient for the main effects of the centered variables that are involved in the interaction, and the intercept, will be different, because they are testing different things. The p values for the tests of the estimated coefficients of the main effect which is not involved in an interaction, along with that for the interaction, will be unchanged. These are general results. In addition to this, in your particular data there could also be issues due to multicollinearity, and the fact that R-squared is reported as 1, is also suspicious.

Best Answer

Related Solutions

Regression – Residualizing Dependent Variable and Two-Step Linear Regression Explained

Linear Regression – Calculating Slope and Intercept from Multiple Linear Regression

Related Question