I have on question regarding standardized coefficients (beta) in linear models. I have already asked one question here. From the answers I assume that I should use R's scale()
function on the dependent variable as well as on all independent variables (IV), to estimate the standardized coefficients for the model. But when I used the scale()
function on an IV, which belongs to the factor class I get following error message:
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
To illustrate my problem here is a MWE:
First the linear model with unstandardized coefficients:
> data(ChickWeight)
> aa <- lm(weight ~ Time + Diet, data=ChickWeight)
> summary(aa)
Call:
lm(formula = weight ~ Time + Diet, data = ChickWeight)
Residuals:
Min 1Q Median 3Q Max
-136.851 -17.151 -2.595 15.033 141.816
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 10.9244 3.3607 3.251 0.00122 **
Time 8.7505 0.2218 39.451 < 2e-16 ***
Diet2 16.1661 4.0858 3.957 8.56e-05 ***
Diet3 36.4994 4.0858 8.933 < 2e-16 ***
Diet4 30.2335 4.1075 7.361 6.39e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 35.99 on 573 degrees of freedom
Multiple R-squared: 0.7453, Adjusted R-squared: 0.7435
F-statistic: 419.2 on 4 and 573 DF, p-value: < 2.2e-16
Now I want to estimate the standardized coefficients using the scale
function, which results in following error message:
> bb <- lm(scale(weight) ~ scale(Time) + scale(Diet), data=ChickWeight)
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
As I figured out by myself the error message appears, because Diet
belongs to the factor class and is not a numeric variable as required from the scale()
function. I tried the following alternatively by including the Diet
variable without scale()
:
> cc <- lm(scale(weight) ~ scale(Time) + Diet, data=ChickWeight)
> summary(cc)
Call:
lm(formula = scale(weight) ~ scale(Time) + Diet, data = ChickWeight)
Residuals:
Min 1Q Median 3Q Max
-1.92552 -0.24132 -0.03652 0.21151 1.99538
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.24069 0.03415 -7.048 5.25e-12 ***
scale(Time) 0.83210 0.02109 39.451 < 2e-16 ***
Diet2 0.22746 0.05749 3.957 8.56e-05 ***
Diet3 0.51356 0.05749 8.933 < 2e-16 ***
Diet4 0.42539 0.05779 7.361 6.39e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5064 on 573 degrees of freedom
Multiple R-squared: 0.7453, Adjusted R-squared: 0.7435
F-statistic: 419.2 on 4 and 573 DF, p-value: < 2.2e-16
My question now is, if this is the right way to estimate the standardized coefficients for a model with both numeric and factor variables?
Thank you very much in advance for an answer.
Regards,
Magnus
Best Answer
I'm not sure that standardized coefficients make much sense when you have dummy variables.
The idea of a standardized coefficient is that it puts the units of the predictor variable into a form that we understand. I have a sense of what a standard deviation is, whereas if I don't know what time is (or know if its measured in seconds, minutes, hours, days, weeks), I can't interpret the units.
If you've got a factor, your measures are dummy coded - you have diet2, or you have diet1. I know what that scale means. Andrew Gelman suggest that instead of dividing by the SD, we divide by 2 SDs, and this makes the effect of a continuous variable comparable to the effect of a dummy coded variable. Paper here: http://www.stat.columbia.edu/~gelman/research/published/standardizing7.pdf , blog entry here: http://andrewgelman.com/2006/06/21/standardizing_r/
Anyway, what you do is not quite the right way, because you don't want standardized coefficients for dummy (factor) variables. But as long as you describe them appropriately, it's OK.
If you really want to, you can standardized the variables before you do the analysis, and get truly standardized coefficients. These will be kind of meaningless though: