Solved – Should I use mean centering or not

centering

I am using a logistic regression model. I want to see interaction effect of a continuous Independent variable on the relationship of another binary independent variable and the dependent variable(DV is also binary). Should I use the direct product of the continuous and the binary variable or I should first do mean-centering for the continuous variable in my case?

Just to add here, I have tried both techniques and getting better results with mean-centered DV but I want to know if I should also do mean-centering for binary variables or not?

Best Answer

Don't center the binary variable. That just makes your interpretation more complicated, and the only reason to center a variable is (nowadays) to help the interpretation. It used to be that centering made it easier for computers to estimate the model, but algorithms have improved sufficiently to make that no longer an issue for most models, and even then centering a binary variable would not have helped.

I claimed that the difference in the models is only a matter of interpretation, but otherwise the two models are completely equivalent. This is best discussed using an example. I use Stata, because that is the package I am most familiar with, but this is about interpretation of results, so the discussion applies to any package. First I open some example data, and do some preliminary preparations. In particular I prepared a centered version of the variable grade. Grade is the level of education attained by the respondents measured in years.

In this case I chose not to center at the mean, but at the value 12. This is US data, so 12 years of education corresponds to having finished highschool. Centering at meaningful values within the range of the data is usually preferable over centering at the mean. First, it is clearer to your audience who you are talking about when saying "someone who finished highschool" than "someone with mean level of education". Second, it makes it easier to replicate your results with different data, as the mean will change (a bit) from dataset to dataset, but 12 will remain 12.

. // open example data
. sysuse nlsw88, clear
(NLSW, 1988 extract)

.
. // prepare the data
. gen byte highoc = occupation < 3 if !missing(occupation)
(9 missing values generated)

. label variable highoc "high occupation"

. label define highoc 1 "higher" ///
>                     0 "lower"

. label value highoc highoc

.
. label define south 0 "non-South" 1 "South"

. label value south south
.
. // create a centered version of grade
. gen grade_c = grade - 12
(2 missing values generated)

. label var grade_c "current grade completed (centered at 12)"

Next I estimated a logit model with the original grade variable (without centering). So the constant refers to the odds of being a union member for someone with all 0s on the explanatory variables, so a single person not from the South, with 0 years of education in a lower occupation. For such persons we expect 0.09 union members for every non-union member. For people with 0 years of education this odds increases by a factor 1.39 (or $[1.39-1]\times 100\%=39\%$) if one moves to the South. We needed to add "for people with 0 years of education", because we included the interaction term between south and grade. For every year additional information this effect of south decreases by a factor 0.92 (or $[0.92-1]\times100\%=-8\%$)

. logit union i.south##c.grade i.highoc i.married, or base nolog

Logistic regression                             Number of obs     =      1,867
                                                LR chi2(5)        =     102.23
                                                Prob > chi2       =     0.0000
Log likelihood = -990.16645                     Pseudo R2         =     0.0491

-------------------------------------------------------------------------------
        union | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
--------------+----------------------------------------------------------------
        south |
   non-South  |          1  (base)
       South  |   1.386257   .8400857     0.54   0.590     .4226752    4.546534
              |
        grade |   1.155084   .0340266     4.89   0.000     1.090282    1.223738
              |
south#c.grade |
       South  |    .922549   .0414072    -1.80   0.072     .8448596    1.007382
              |
       highoc |
       lower  |          1  (base)
      higher  |   .4300277    .060926    -5.96   0.000     .3257608    .5676675
              |
      married |
      single  |          1  (base)
     married  |   .7222853   .0826679    -2.84   0.004     .5771464    .9039232
              |
        _cons |   .0946328     .03824    -5.83   0.000     .0428628    .2089308
-------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Someone with 0 years of education is pretty extreme in the US, so what would be the effect of south for someone who finished highschool (12 years of education)? Here is how you could compute that in Stata. You see that the odds of being a union member is a factor .53 smaller in the South (or the odds changes by -47%).

. lincom 1.south + 12*1.south#c.grade, or

 ( 1)  [union]1.south + 12*[union]1.south#c.grade = 0

------------------------------------------------------------------------------
       union | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |   .5268865   .0686721    -4.92   0.000      .408108    .6802351
------------------------------------------------------------------------------

Similarly, we could compute the baseline odds for a single person from the non-South, a lower occupation, and who finished highschool instead of having 0 years of education. We can see that for such a person we expect to find .53 union members for every non-union member.

. lincom _cons + 12*grade, or

 ( 1)  12*[union]grade + [union]_cons = 0

------------------------------------------------------------------------------
       union | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         (1) |    .533833   .0598071    -5.60   0.000     .4285905    .6649184
------------------------------------------------------------------------------

This way of getting more meaningful results is tedious and errors can easily be made. A simple way of avoiding that is to use the centered version of the grade variable. Notice that the log-likelihood, and all coefficients are the same as in the previous model except for the coefficients for the constant and the main effect of grade. Moreover, these two are exactly the same as the coefficients we computed afterwards. So the two models are equivalent, but the one with the centered version of grade is easier to interpret.

. logit union i.south##c.grade_c i.highoc i.married, or base nolog

Logistic regression                             Number of obs     =      1,867
                                                LR chi2(5)        =     102.23
                                                Prob > chi2       =     0.0000
Log likelihood = -990.16645                     Pseudo R2         =     0.0491

---------------------------------------------------------------------------------
          union | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
----------------+----------------------------------------------------------------
          south |
     non-South  |          1  (base)
         South  |   .5268865   .0686721    -4.92   0.000      .408108    .6802351
                |
        grade_c |   1.155084   .0340266     4.89   0.000     1.090282    1.223738
                |
south#c.grade_c |
         South  |    .922549   .0414072    -1.80   0.072     .8448596    1.007382
                |
         highoc |
         lower  |          1  (base)
        higher  |   .4300277    .060926    -5.96   0.000     .3257608    .5676675
                |
        married |
        single  |          1  (base)
       married  |   .7222853   .0826679    -2.84   0.004     .5771464    .9039232
                |
          _cons |    .533833   .0598071    -5.60   0.000     .4285905    .6649184
---------------------------------------------------------------------------------
Note: _cons estimates baseline odds.

Related Solutions

Solved – Should I re-center variables when looking at moderator effect in men and women separately

Centring: Centring does not change the significance of the r-square change of your interaction effect. It also will not change the values you get for a simple slopes analysis.

Thus, for most purposes it does not matter whether you centre or not. This applies both to the general analysis, and to the subgroup analysis.

The main benefit of centring is that it can make the interpretation of the regression coefficients a little easier. If you want to compare these absolute size of these coefficients across males and females, then you should only centre once.

Prefer integrated models: A better suggestion is to include gender in your overall multiple regression. For example, if you have DV, IV1, IV2 and gender and you are interested in the IV1 * IV2 interaction for each gender. I'd examine various models such as:

DV ~ IV1 + IV2 + gender
DV ~ IV1 * IV2 + gender
DV ~ IV1 * IV2 + gender * IV1 + gender*IV2
DV ~ IV1 * IV2 * gender

If you get a significant gender by something interaction, then you may wish to further explore this using separate analyses, but I'd start with the overall integrated model.

Illustrating points about centered predictors:

The following code returns the p-value of the r-square change and the final r-square for both an uncentered and three centred versions (global, female centred, male centred) of an interaction effect model.

library(MASS)
survey <- na.omit(survey)
head(survey)

x <- survey[, c('Sex', 'Wr.Hnd', 'NW.Hnd', 'Pulse')]
names(x) <- c('gender', 'iv1', 'iv2', 'dv')
x$scaled_iv1 <- scale(x$iv1, scale=FALSE)
x$scaled_iv2 <- scale(x$iv2, scale=FALSE)
x$female_scaled_iv1 <- scale(x$iv1, center=mean(x[x$gender == "Female", 'iv1']), scale=FALSE)
    x$female_scaled_iv2 <- scale(x$iv2, center=mean(x[x$gender == "Female", 'iv2']), scale=FALSE)
x$male_scaled_iv1 <- scale(x$iv1, center=mean(x[x$gender == "Male", 'iv1']), scale=FALSE)
    x$male_scaled_iv2 <- scale(x$iv2, center=mean(x[x$gender == "Male", 'iv2']), scale=FALSE)

compare_fits <- function(x) {
    fit1 <- lm(dv ~ iv1+iv2, x)
    fit2 <- lm(dv ~ iv1*iv2, x)
    fit3 <- lm(dv ~ scaled_iv1*scaled_iv2, x)
    fit4 <- lm(dv ~ male_scaled_iv1*male_scaled_iv2, x)
    fit5 <- lm(dv ~ female_scaled_iv1*female_scaled_iv2, x)
    results <- list()
    results$p_normal <-  anova(fit1, fit2)[2,6]
        results$p_centered <- anova(fit1, fit3)[2,6]
    results$p_centered_male <- anova(fit1, fit4)[2,6]
        results$p_centered_female <- anova(fit1, fit5)[2,6]
    results$rsq_normal <- summary(fit2)$r.squared
    results$rsq_centered <- summary(fit3)$r.squared
    results$rsq_centered_male <- summary(fit4)$r.squared
    results$rsq_centered_female <- summary(fit5)$r.squared
    unlist(results)
}

# The following results report p-values and rsq for final model
# using normal (i.e., uncentered) and centered predictors
compare_fits(x)
compare_fits(x[x$gender=='Male', ])
    compare_fits(x[x$gender=='Female', ])

The results show how the values do not vary across uncentered and centered analyses.

> compare_fits(x)
           p_normal          p_centered     p_centered_male   p_centered_female          rsq_normal 
        0.241816265         0.241816265         0.241816265         0.241816265         0.009982317 
       rsq_centered   rsq_centered_male rsq_centered_female 
        0.009982317         0.009982317         0.009982317 
> compare_fits(x[x$gender=='Male', ])
               p_normal          p_centered     p_centered_male   p_centered_female          rsq_normal 
             0.14034102          0.14034102          0.14034102          0.14034102          0.03055692 
           rsq_centered   rsq_centered_male rsq_centered_female 
             0.03055692          0.03055692          0.03055692 
    > compare_fits(x[x$gender=='Female', ])
           p_normal          p_centered     p_centered_male   p_centered_female          rsq_normal 
          0.5196788           0.5196788           0.5196788           0.5196788           0.0128802 
       rsq_centered   rsq_centered_male rsq_centered_female 
          0.0128802           0.0128802           0.0128802

Solved – Centering data in multiple regression

With continuous dependent variables, you can center these too if you want. Just don't forget that your predicted values have had the mean subtracted from them; otherwise, you should be able to interpret the results normally. If you're not sure whether you want to center in a case like this, or want to consider other issues, you might find this question useful: When conducting multiple regression, when should you center your predictor variables & when should you standardize them?

With categorical variables, the mean may not be appropriate to use for centering, and the data may not be appropriate for fitting a multiple regression model with ordinary least squares. When averaging a reasonably large number of Likert scale responses (say, across five or more items) with a reasonably wide set of options (five options might be enough), you might be okay in using the mean, but you should probably check whether your response frequencies for each item seem to be approximating a normal distribution (i.e., not a distribution with strong skew, excess kurtosis, a bimodal shape, etc.). When you average them across your set of items, check again to make sure these scores seems roughly normal.

If they're not, you might need to explore other methods for handling ordinal data in regression. Item response theory models like the rating scale model might be more suitable. You could also try fitting a structural equation model that relates the latent factors represented by your Likert rated items to your dependent variables using a polychoric correlation matrix. You might find my answer to a related question useful for this.

Best Answer

Related Solutions

Solved – Should I re-center variables when looking at moderator effect in men and women separately

Solved – Centering data in multiple regression

Related Question