Solved – Huge difference in regression standard error after log transformation of dependent variable

data transformationmultiple regressionrregressionresiduals

I try to learn which transformations are better for model and I am trying to compare models that I build. The first model is

Call:
lm(formula = log(medv) ~ log(crim) + zn + log(indus) + chas + 
log(nox) + log(rm) + log(age) + log(dis) + log(rad) + log(tax) + 
log(ptratio) + log(black) + log(lstat), data = Boston)

Residuals:
Min       1Q   Median       3Q      Max 
-0.95001 -0.10118 -0.00198  0.10961  0.82680 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
(Intercept)   5.3504375  0.4336744  12.337  < 2e-16 ***
log(crim)    -0.0314413  0.0111790  -2.813 0.005112 ** 
zn           -0.0011481  0.0005828  -1.970 0.049410 *  
log(indus)    0.0037637  0.0224508   0.168 0.866935    
chas          0.1011952  0.0362298   2.793 0.005423 ** 
log(nox)     -0.3659159  0.1074552  -3.405 0.000715 ***
log(rm)       0.3843709  0.1094673   3.511 0.000487 ***
log(age)      0.0410625  0.0223547   1.837 0.066833 .  
log(dis)     -0.1438053  0.0356083  -4.039 6.24e-05 ***
log(rad)      0.0949062  0.0220954   4.295 2.10e-05 ***
log(tax)     -0.1759806  0.0477668  -3.684 0.000255 ***
log(ptratio) -0.5895440  0.0912645  -6.460 2.52e-10 ***
log(black)    0.0532854  0.0126549   4.211 3.03e-05 ***
log(lstat)   -0.4186032  0.0258019 -16.224  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1988 on 492 degrees of freedom
Multiple R-squared:  0.7697,    Adjusted R-squared:  0.7636 
F-statistic: 126.5 on 13 and 492 DF,  p-value: < 2.2e-16

Second Model is

Call:
lm(formula = medv ~ log(crim) + zn + log(indus) + chas + log(nox) + 
log(rm) + log(age) + log(dis) + log(rad) + log(tax) + log(ptratio) + 
log(black) + log(lstat), data = Boston)

Residuals:
Min       1Q   Median       3Q      Max 
-13.3551  -2.5733  -0.2924   2.0704  22.8158 

Coefficients:
           Estimate Std. Error t value Pr(>|t|)    
(Intercept)   7.449e+01  9.307e+00   8.004 8.74e-15 ***
log(crim)     7.002e-02  2.399e-01   0.292 0.770524    
zn           -1.257e-04  1.251e-02  -0.010 0.991983    
log(indus)   -8.557e-01  4.818e-01  -1.776 0.076366 .  
chas          2.480e+00  7.775e-01   3.190 0.001514 ** 
log(nox)     -1.160e+01  2.306e+00  -5.030 6.90e-07 ***
log(rm)       1.374e+01  2.349e+00   5.850 8.98e-09 ***
log(age)      8.034e-01  4.798e-01   1.675 0.094658 .  
log(dis)     -6.327e+00  7.642e-01  -8.280 1.17e-15 ***
log(rad)      1.972e+00  4.742e-01   4.158 3.78e-05 ***
log(tax)     -4.277e+00  1.025e+00  -4.172 3.57e-05 ***
log(ptratio) -1.357e+01  1.959e+00  -6.927 1.35e-11 ***
log(black)    1.005e+00  2.716e-01   3.701 0.000239 ***
log(lstat)   -9.654e+00  5.537e-01 -17.433  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.266 on 492 degrees of freedom
Multiple R-squared:  0.7904,    Adjusted R-squared:  0.7849 
F-statistic: 142.7 on 13 and 492 DF,  p-value: < 2.2e-16

The difference between models is only log transformation of dependent variable. When I compare I saw that residual standard error is very high in the second model but R-squared is also high in the second model. I did not understand which model is better. The high reduction in the standard error is due to log transformation of dependent variable or not?

Best Answer

@Tim is right that the log is changing the residual standard error and that comparing this on the two errors is meaningless. Why is this so? Consider a much simpler case: Suppose the DV is income (in dollars) and one scale predicts Joe's income to be \$100,000 when his real income is \$90,000. Error is $10,000. Take log (base 10) and get a predicted value (even if everything else stays the same) of 5 and an actual value of 4.95 and an error of 0.05 (this isn't exactly what's going on, but I think it gives you a feel for the reason things change).

Whether you should transform your DV should depend on substantive reasons more than statistical ones. You didn't say what MEDV and the other variables are, but it looks like MEDV is median value and this is predicted cost of a house or something like that.

When the DV is a dollar amount, taking logs often makes sense because we often think of these amounts on a multiplicative scale. That is, the difference between a \$100,000 house and a \$200,000 house is huge. The difference between a \$1,000,000 house and a \$1,100,000 house is much smaller.