Solved – Regression RMSE when dependent variable is log transformed

back-transformationdata transformationmachine learningregression

I want to predict the duration a trip would take. For this I transformed my dependent variable (trip time in sec) to log transformed.

When I do regression on this variable with some other features,

I get this:

The score on held out data is: 0.08395386395024673
 Hyper-Parameters for Best Score : {'l1_ratio': 0.15, 'alpha': 0.01}

The R2 Score of sgd_regressor on test data is: 0.0864573982691922

The mse of sgd_regressor on test data is: 0.5503753581
The mean absolute error of sgd_regressor on test data is: 0.566328128068

This is the code which does the above calculation:

from sklearn.metrics import mean_squared_error, mean_absolute_error

    # 
    print("The R2 Score of "+ name + " on test data is: {}\n".format(self.g_cv.best_estimator_.score(self.test_X,self.test_Y)))

    print ("The mse of "+ name + " on test data is:",\
           mean_squared_error(test_Y, self.g_cv.best_estimator_.predict(self.test_X)))

    print ("The mean absolute error of "+ name + " on test data is:",\
           mean_absolute_error(test_Y, self.g_cv.best_estimator_.predict(self.test_X)))

Problem is R2 as you see is very bad. 0.08, but RMSE and Mean Absolute error seem to be very low. If I look at Mean Absolute Error, its just 0.56 sec. Which means on an average my predicted time is only half a second different from true time.

Something doesn't look right. Do I need to convert the predicted and original time variable back to linear scale from log scale before I calculate the above metrics (RMSE and MAE)?.

Best Answer

Once you take logs, your response is not in seconds. In effect it's unit free.

When you calculate mean absolute error on the log scale, it, too, is not a measurement in seconds.

It's (roughly-speaking) telling you something about the typical size of percentage error on the original scale.

An MAE(-of-the-logs) of 0.01 would tell you that typically your original values deviate by about 1% from the geometric mean.

Let $z_i=\log(y_i)$. Then an MAE of 0.01 in the logs means that $\frac{_1}{^n}|z_i-\bar{z}|=0.01$. Now on the original scale $\exp(\bar{z})$ is the geometric mean of the $y$-values, $\text{GM}(y)$.

Now consider observations sitting as far away from the mean as the MAE: $z_i=\bar{z}+ 0.01$ and $z_j = \bar{z}- 0.01$. Then

$y_i=\exp(z_i) = \exp(\bar{y}) \times \exp(0.01)$ $= 1.01005 \text{ GM}(y)\approx 1.01 \text{ GM}(y)$

or about 1% above the geometric mean. Similarly

$y_j=\exp(z_j)$ $= \exp(\bar{y}) \times \exp(-0.01)$ $= 0.99005 \text{ GM}(y)$ $\approx 0.99 \text{ GM}(y)$

or about 1% below the geometric mean.

Similarly an MAE (log scale) of 0.10 would tell you that typically your original values deviate by about 10.5% from the geometric mean. As you move further away (as MAE gets bigger) this convenient approximate-percentage relationship changes.

There's nothing wrong with calculating a MAE on the log scale as long as you don't misinterpret what it is. If you want an MAE on the original scale you'd need to compute it on that scale (but the fact that you're working with modelling the logs suggests that perhaps it may not actually be especially useful on the original scale)

Related Question