I try to learn which transformations are better for model and I am trying to compare models that I build. The first model is
Call:
lm(formula = log(medv) ~ log(crim) + zn + log(indus) + chas +
log(nox) + log(rm) + log(age) + log(dis) + log(rad) + log(tax) +
log(ptratio) + log(black) + log(lstat), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-0.95001 -0.10118 -0.00198 0.10961 0.82680
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.3504375 0.4336744 12.337 < 2e-16 ***
log(crim) -0.0314413 0.0111790 -2.813 0.005112 **
zn -0.0011481 0.0005828 -1.970 0.049410 *
log(indus) 0.0037637 0.0224508 0.168 0.866935
chas 0.1011952 0.0362298 2.793 0.005423 **
log(nox) -0.3659159 0.1074552 -3.405 0.000715 ***
log(rm) 0.3843709 0.1094673 3.511 0.000487 ***
log(age) 0.0410625 0.0223547 1.837 0.066833 .
log(dis) -0.1438053 0.0356083 -4.039 6.24e-05 ***
log(rad) 0.0949062 0.0220954 4.295 2.10e-05 ***
log(tax) -0.1759806 0.0477668 -3.684 0.000255 ***
log(ptratio) -0.5895440 0.0912645 -6.460 2.52e-10 ***
log(black) 0.0532854 0.0126549 4.211 3.03e-05 ***
log(lstat) -0.4186032 0.0258019 -16.224 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1988 on 492 degrees of freedom
Multiple R-squared: 0.7697, Adjusted R-squared: 0.7636
F-statistic: 126.5 on 13 and 492 DF, p-value: < 2.2e-16
Second Model is
Call:
lm(formula = medv ~ log(crim) + zn + log(indus) + chas + log(nox) +
log(rm) + log(age) + log(dis) + log(rad) + log(tax) + log(ptratio) +
log(black) + log(lstat), data = Boston)
Residuals:
Min 1Q Median 3Q Max
-13.3551 -2.5733 -0.2924 2.0704 22.8158
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.449e+01 9.307e+00 8.004 8.74e-15 ***
log(crim) 7.002e-02 2.399e-01 0.292 0.770524
zn -1.257e-04 1.251e-02 -0.010 0.991983
log(indus) -8.557e-01 4.818e-01 -1.776 0.076366 .
chas 2.480e+00 7.775e-01 3.190 0.001514 **
log(nox) -1.160e+01 2.306e+00 -5.030 6.90e-07 ***
log(rm) 1.374e+01 2.349e+00 5.850 8.98e-09 ***
log(age) 8.034e-01 4.798e-01 1.675 0.094658 .
log(dis) -6.327e+00 7.642e-01 -8.280 1.17e-15 ***
log(rad) 1.972e+00 4.742e-01 4.158 3.78e-05 ***
log(tax) -4.277e+00 1.025e+00 -4.172 3.57e-05 ***
log(ptratio) -1.357e+01 1.959e+00 -6.927 1.35e-11 ***
log(black) 1.005e+00 2.716e-01 3.701 0.000239 ***
log(lstat) -9.654e+00 5.537e-01 -17.433 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.266 on 492 degrees of freedom
Multiple R-squared: 0.7904, Adjusted R-squared: 0.7849
F-statistic: 142.7 on 13 and 492 DF, p-value: < 2.2e-16
The difference between models is only log transformation of dependent variable. When I compare I saw that residual standard error is very high in the second model but R-squared is also high in the second model. I did not understand which model is better. The high reduction in the standard error is due to log transformation of dependent variable or not?
Best Answer
@Tim is right that the log is changing the residual standard error and that comparing this on the two errors is meaningless. Why is this so? Consider a much simpler case: Suppose the DV is income (in dollars) and one scale predicts Joe's income to be \$100,000 when his real income is \$90,000. Error is $10,000. Take log (base 10) and get a predicted value (even if everything else stays the same) of 5 and an actual value of 4.95 and an error of 0.05 (this isn't exactly what's going on, but I think it gives you a feel for the reason things change).
Whether you should transform your DV should depend on substantive reasons more than statistical ones. You didn't say what MEDV and the other variables are, but it looks like MEDV is median value and this is predicted cost of a house or something like that.
When the DV is a dollar amount, taking logs often makes sense because we often think of these amounts on a multiplicative scale. That is, the difference between a \$100,000 house and a \$200,000 house is huge. The difference between a \$1,000,000 house and a \$1,100,000 house is much smaller.