Solved – Should I re-center variables when looking at moderator effect in men and women separately

centeringgroup-differencesinteractionregression

I want to see if an interaction variable in a multiple regression is significant for the whole sample, and then just for men and just for women. When I created the interaction variable for the whole sample, I centered the interaction components by subtracting the mean for the whole sample.

Now, when I want to look at men and women separately should I recalculate male and female specific centered and interaction variables, centering them with the respective male and female sample means for the interaction components?

Best Answer

Centring: Centring does not change the significance of the r-square change of your interaction effect. It also will not change the values you get for a simple slopes analysis.

Thus, for most purposes it does not matter whether you centre or not. This applies both to the general analysis, and to the subgroup analysis.

The main benefit of centring is that it can make the interpretation of the regression coefficients a little easier. If you want to compare these absolute size of these coefficients across males and females, then you should only centre once.

Prefer integrated models: A better suggestion is to include gender in your overall multiple regression. For example, if you have DV, IV1, IV2 and gender and you are interested in the IV1 * IV2 interaction for each gender. I'd examine various models such as:

DV ~ IV1 + IV2 + gender
DV ~ IV1 * IV2 + gender
DV ~ IV1 * IV2 + gender * IV1 + gender*IV2
DV ~ IV1 * IV2 * gender

If you get a significant gender by something interaction, then you may wish to further explore this using separate analyses, but I'd start with the overall integrated model.

Illustrating points about centered predictors:

The following code returns the p-value of the r-square change and the final r-square for both an uncentered and three centred versions (global, female centred, male centred) of an interaction effect model.

library(MASS)
survey <- na.omit(survey)
head(survey)

x <- survey[, c('Sex', 'Wr.Hnd', 'NW.Hnd', 'Pulse')]
names(x) <- c('gender', 'iv1', 'iv2', 'dv')
x$scaled_iv1 <- scale(x$iv1, scale=FALSE)
x$scaled_iv2 <- scale(x$iv2, scale=FALSE)
x$female_scaled_iv1 <- scale(x$iv1, center=mean(x[x$gender == "Female", 'iv1']), scale=FALSE)
    x$female_scaled_iv2 <- scale(x$iv2, center=mean(x[x$gender == "Female", 'iv2']), scale=FALSE)
x$male_scaled_iv1 <- scale(x$iv1, center=mean(x[x$gender == "Male", 'iv1']), scale=FALSE)
    x$male_scaled_iv2 <- scale(x$iv2, center=mean(x[x$gender == "Male", 'iv2']), scale=FALSE)

compare_fits <- function(x) {
    fit1 <- lm(dv ~ iv1+iv2, x)
    fit2 <- lm(dv ~ iv1*iv2, x)
    fit3 <- lm(dv ~ scaled_iv1*scaled_iv2, x)
    fit4 <- lm(dv ~ male_scaled_iv1*male_scaled_iv2, x)
    fit5 <- lm(dv ~ female_scaled_iv1*female_scaled_iv2, x)
    results <- list()
    results$p_normal <-  anova(fit1, fit2)[2,6]
        results$p_centered <- anova(fit1, fit3)[2,6]
    results$p_centered_male <- anova(fit1, fit4)[2,6]
        results$p_centered_female <- anova(fit1, fit5)[2,6]
    results$rsq_normal <- summary(fit2)$r.squared
    results$rsq_centered <- summary(fit3)$r.squared
    results$rsq_centered_male <- summary(fit4)$r.squared
    results$rsq_centered_female <- summary(fit5)$r.squared
    unlist(results)
}

# The following results report p-values and rsq for final model
# using normal (i.e., uncentered) and centered predictors
compare_fits(x)
compare_fits(x[x$gender=='Male', ])
    compare_fits(x[x$gender=='Female', ])

The results show how the values do not vary across uncentered and centered analyses.

> compare_fits(x)
           p_normal          p_centered     p_centered_male   p_centered_female          rsq_normal 
        0.241816265         0.241816265         0.241816265         0.241816265         0.009982317 
       rsq_centered   rsq_centered_male rsq_centered_female 
        0.009982317         0.009982317         0.009982317 
> compare_fits(x[x$gender=='Male', ])
               p_normal          p_centered     p_centered_male   p_centered_female          rsq_normal 
             0.14034102          0.14034102          0.14034102          0.14034102          0.03055692 
           rsq_centered   rsq_centered_male rsq_centered_female 
             0.03055692          0.03055692          0.03055692 
    > compare_fits(x[x$gender=='Female', ])
           p_normal          p_centered     p_centered_male   p_centered_female          rsq_normal 
          0.5196788           0.5196788           0.5196788           0.5196788           0.0128802 
       rsq_centered   rsq_centered_male rsq_centered_female 
          0.0128802           0.0128802           0.0128802

Related Solutions

Solved – Interaction term using centered variables hierarchical regression analysis? What variables should we center

You should center the terms involved in the interaction to reduce collinearity e.g.

set.seed(10204)
x1 <- rnorm(1000, 10, 1)
x2 <- rnorm(1000, 10, 1)
y <- x1 + rnorm(1000, 5, 5)  + x2*rnorm(1000) + x1*x2*rnorm(1000) 

x1cent <- x1 - mean(x1)
x2cent <- x2 - mean(x2)
x1x2cent <- x1cent*x2cent

m1 <- lm(y ~ x1 + x2 + x1*x2)
m2 <- lm(y ~ x1cent + x2cent + x1cent*x2cent)

summary(m1)
summary(m2)

Output:

> summary(m1)

Call:
lm(formula = y ~ x1 + x2 + x1 * x2)

Residuals:
    Min      1Q  Median      3Q     Max 
-344.62  -66.29   -1.44   66.05  392.22 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  193.333    335.281   0.577    0.564
x1           -15.830     33.719  -0.469    0.639
x2           -14.065     33.567  -0.419    0.675
x1:x2          1.179      3.375   0.349    0.727

Residual standard error: 101.3 on 996 degrees of freedom
Multiple R-squared:  0.002363,  Adjusted R-squared:  -0.0006416 
F-statistic: 0.7865 on 3 and 996 DF,  p-value: 0.5015

> summary(m2)

Call:
lm(formula = y ~ x1cent + x2cent + x1cent * x2cent)

Residuals:
    Min      1Q  Median      3Q     Max 
-344.62  -66.29   -1.44   66.05  392.22 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)     12.513      3.203   3.907 9.99e-05 ***
x1cent          -4.106      3.186  -1.289    0.198    
x2cent          -2.291      3.198  -0.716    0.474    
x1cent:x2cent    1.179      3.375   0.349    0.727    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 101.3 on 996 degrees of freedom
Multiple R-squared:  0.002363,  Adjusted R-squared:  -0.0006416 
F-statistic: 0.7865 on 3 and 996 DF,  p-value: 0.5015


library(perturb)
colldiag(m1)
colldiag(m2)

Whether you center other variables is up to you; centering (as opposed to standardizing) a variable that is not involved in an interaction will change the meaning of the intercept, but not other things e.g.

x1 <- rnorm(1000, 10, 1)
x2 <- x1 - mean(x1)
y <- x1 + rnorm(1000, 5, 5) 
m1 <- lm(y ~ x1)
m2 <- lm(y ~ x2)

summary(m1)
summary(m2)

Output:

> summary(m1)

Call:
lm(formula = y ~ x1)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.5288  -3.3348   0.0946   3.4293  14.0678 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   6.5412     1.6003   4.087 4.71e-05 ***
x1            0.8548     0.1591   5.373 9.63e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.082 on 998 degrees of freedom
Multiple R-squared:  0.02812,   Adjusted R-squared:  0.02714 
F-statistic: 28.87 on 1 and 998 DF,  p-value: 9.629e-08

> summary(m2)

Call:
lm(formula = y ~ x2)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.5288  -3.3348   0.0946   3.4293  14.0678 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  15.0965     0.1607  93.931  < 2e-16 ***
x2            0.8548     0.1591   5.373 9.63e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.082 on 998 degrees of freedom
Multiple R-squared:  0.02812,   Adjusted R-squared:  0.02714 
F-statistic: 28.87 on 1 and 998 DF,  p-value: 9.629e-08

But you should take logs of variables because it makes sense to do so or because the residuals from the model indicate that you should, not because they have a lot of variability. Regression does not make assumptions about the distribution of the variables, it makes assumptions about the distribution of the residuals.

Solved – Why could centering independent variables change the main effects with moderation

In models with no interaction terms (that is, with no terms that are constructed as the product of other terms), each variable's regression coefficient is the slope of the regression surface in the direction of that variable. It is constant, regardless of the values of the variables, and therefore can be said to measure the overall effect of that variable.

In models with interactions, this interpretation can be made without further qualification only for those variables that are not involved in any interactions. For a variable that is involved in interactions, the "main-effect" regression coefficient -- that is, the regression coefficient of the variable by itself -- is the slope of the regression surface in the direction of that variable when all other variables that interact with that variable have values of zero, and the significance test of the coefficient refers to the slope of the regression surface only in that region of the predictor space. Since there is no requirement that there actually be data in that region of the space, the main-effect coefficient may bear little resemblance to the slope of the regression surface in the region of the predictor space where data were actually observed.

In anova terms, the main-effect coefficient is analogous to a simple main effect, not an overall main effect. Moreover, it may refer to what in an anova design would be empty cells in which the data were supplied by extrapolating from cells with data.

For a measure of the overall effect of the variable that is analogous to an overall main effect in anova and does not extrapolate beyond the region in which data were observed, we must look at the average slope of the regression surface in the direction of the variable, where the averaging is over the N cases that were actually observed. This average slope can be expressed as a weighted sum of the regression coefficients of all the terms in the model that involve the variable in question.

The weights are awkward to describe but easy to get. A variable's main-effect coefficient always gets a weight of 1. For each other coefficient of a term involving that variable, the weight is the mean of the product of the other variables in that term. For example, if we have five "raw" variables x1, x2, x3, x4, x5, plus four two-way interactions (x1,x2), (x1,x3), (x2,x3), (x4,x5), and one three-way interaction (x1,x2,x3), then the model is

y = b0 + b1*x1 + b2*x2 + b3*x3 + b4*x4 + b5*x5 +
    b12*x1*x2 + b13*x1*x3 + b23*x2*x3 + b45*x4*x5 +
    b123*x1*x2*x3 + e

and the overall main effects are

B1 = b1 + b12*M[x2] + b13*M[x3] + b123*M[x2*x3],

B2 = b2 + b12*M[x1] + b23*M[x3] + b123*M[x1*x3],

B3 = b3 + b13*M[x1] + b23*M[x2] + b123*M[x1*x2],

B4 = b4 + b45*M[x5],

B5 = b5 + b45*M[x4],

where M[.] denotes the sample mean of the quantity inside the brackets. All the product terms inside the brackets are among those that were constructed in order to do the regression, so a regression program should already know about them and should be able to print their means on request.

In models that have only main effects and two-way interactions, there is a simpler way to get the overall effects: center[1] the raw variables at their means. This is to be done prior to computing the product terms, and is not to be done to the products. Then all the M[.] expressions will become 0, and the regression coefficients will be interpretable as overall effects. The values of the b's will change; the values of the B's will not. Only the variables that are involved in interactions need to be centered, but there is usually no harm in centering other measured variables. The general effect of centering a variable is that, in addition to changing the intercept, it changes only the coefficients of other variables that interact with the centered variable. In particular, it does not change the coefficients of any terms that involve the centered variable. In the example given above, centering x1 would change b0, b2, b3, and b23.

[1 -- "Centering" is used by different people in ways that differ just enough to cause confusion. As used here, "centering a variable at #" means subtracting # from all the scores on the variable, converting the original scores to deviations from #.]

So why not always center at the means, routinely? Three reasons. First, the main-effect coefficients of the uncentered variables may themselves be of interest. Centering in such cases would be counter-productive, since it changes the main-effect coefficients of other variables.

Second, centering will make all the M[.] expressions 0, and thus convert simple effects to overall effects, only in models with no three-way or higher interactions. If the model contains such interactions then the b -> B computations must still be done, even if all the variables are centered at their means.

Third, centering at a value such as the mean, that is defined by the distribution of the predictors as opposed to being chosen rationally, means that all coefficients that are affected by centering will be specific to your particular sample. If you center at the mean then someone attempting to replicate your study must center at your mean, not their own mean, if they want to get the same coefficients that you got. The solution to this problem is to center each variable at a rationally chosen central value of that variable that depends on the meaning of the scores and does not depend on the distribution of the scores. However, the b -> B computations still remain necessary.

The significance of the overall effects may be tested by the usual procedures for testing linear combinations of regression coefficients. However, the results must be interpreted with care because the overall effects are not structural parameters but are design-dependent. The structural parameters -- the regression coefficients (uncentered, or with rational centering) and the error variance -- may be expected to remain invariant under changes in the distribution of the predictors, but the overall effects will generally change. The overall effects are specific to the particular sample and should not be expected to carry over to other samples with different distributions on the predictors. If an overall effect is significant in one study and not in another, it may reflect nothing more than a difference in the distribution of the predictors. In particular, it should not be taken as evidence that the relation of the dependent variable to the predictors is different in the two studies.

Best Answer

Related Solutions

Solved – Interaction term using centered variables hierarchical regression analysis? What variables should we center

Solved – Why could centering independent variables change the main effects with moderation

Related Question