R Multiple Regression – Using Scale() Function for Standardized Coefficients in Linear Models

multiple regressionrstandardization

I have on question regarding standardized coefficients (beta) in linear models. I have already asked one question here. From the answers I assume that I should use R's scale() function on the dependent variable as well as on all independent variables (IV), to estimate the standardized coefficients for the model. But when I used the scale() function on an IV, which belongs to the factor class I get following error message:

Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

To illustrate my problem here is a MWE:

First the linear model with unstandardized coefficients:

> data(ChickWeight)
> aa <- lm(weight ~ Time + Diet, data=ChickWeight)
> summary(aa)

Call: 
lm(formula = weight ~ Time + Diet, data = ChickWeight)

Residuals:
     Min       1Q   Median       3Q      Max 
-136.851  -17.151   -2.595   15.033  141.816 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  10.9244     3.3607   3.251  0.00122 ** 
Time          8.7505     0.2218  39.451  < 2e-16 ***
Diet2        16.1661     4.0858   3.957 8.56e-05 ***
Diet3        36.4994     4.0858   8.933  < 2e-16 ***
Diet4        30.2335     4.1075   7.361 6.39e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 35.99 on 573 degrees of freedom
Multiple R-squared:  0.7453,    Adjusted R-squared:  0.7435 
F-statistic: 419.2 on 4 and 573 DF,  p-value: < 2.2e-16

Now I want to estimate the standardized coefficients using the scale function, which results in following error message:

> bb <- lm(scale(weight) ~ scale(Time) + scale(Diet), data=ChickWeight)
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

As I figured out by myself the error message appears, because Diet belongs to the factor class and is not a numeric variable as required from the scale() function. I tried the following alternatively by including the Diet variable without scale():

> cc <- lm(scale(weight) ~ scale(Time) + Diet, data=ChickWeight)
> summary(cc)

Call:
lm(formula = scale(weight) ~ scale(Time) + Diet, data = ChickWeight)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.92552 -0.24132 -0.03652  0.21151  1.99538 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -0.24069    0.03415  -7.048 5.25e-12 ***
scale(Time)  0.83210    0.02109  39.451  < 2e-16 ***
Diet2        0.22746    0.05749   3.957 8.56e-05 ***
Diet3        0.51356    0.05749   8.933  < 2e-16 ***
Diet4        0.42539    0.05779   7.361 6.39e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5064 on 573 degrees of freedom
Multiple R-squared:  0.7453,    Adjusted R-squared:  0.7435 
F-statistic: 419.2 on 4 and 573 DF,  p-value: < 2.2e-16

My question now is, if this is the right way to estimate the standardized coefficients for a model with both numeric and factor variables?

Thank you very much in advance for an answer.

Regards,

Magnus

Best Answer

I'm not sure that standardized coefficients make much sense when you have dummy variables.

The idea of a standardized coefficient is that it puts the units of the predictor variable into a form that we understand. I have a sense of what a standard deviation is, whereas if I don't know what time is (or know if its measured in seconds, minutes, hours, days, weeks), I can't interpret the units.

If you've got a factor, your measures are dummy coded - you have diet2, or you have diet1. I know what that scale means. Andrew Gelman suggest that instead of dividing by the SD, we divide by 2 SDs, and this makes the effect of a continuous variable comparable to the effect of a dummy coded variable. Paper here: http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf , blog entry here: http://andrewgelman.com/2006/06/21/standardizing_r/

Anyway, what you do is not quite the right way, because you don't want standardized coefficients for dummy (factor) variables. But as long as you describe them appropriately, it's OK.

If you really want to, you can standardized the variables before you do the analysis, and get truly standardized coefficients. These will be kind of meaningless though:

ChickWeight$d2 <- scale(ChickWeight$Diet == 2)
ChickWeight$d3 <- scale(ChickWeight$Diet == 3)
ChickWeight$d4 <- scale(ChickWeight$Diet == 4) 
bb <- lm(scale(weight) ~ scale(Time) + d2 + d3 + d4, data=ChickWeight   )