Solved – Regression RMSE when dependent variable is log transformed

back-transformationdata transformationmachine learningregression

I want to predict the duration a trip would take. For this I transformed my dependent variable (trip time in sec) to log transformed.

When I do regression on this variable with some other features,

I get this:

The score on held out data is: 0.08395386395024673
 Hyper-Parameters for Best Score : {'l1_ratio': 0.15, 'alpha': 0.01}

The R2 Score of sgd_regressor on test data is: 0.0864573982691922

The mse of sgd_regressor on test data is: 0.5503753581
The mean absolute error of sgd_regressor on test data is: 0.566328128068

This is the code which does the above calculation:

from sklearn.metrics import mean_squared_error, mean_absolute_error

    # 
    print("The R2 Score of "+ name + " on test data is: {}\n".format(self.g_cv.best_estimator_.score(self.test_X,self.test_Y)))

    print ("The mse of "+ name + " on test data is:",\
           mean_squared_error(test_Y, self.g_cv.best_estimator_.predict(self.test_X)))

    print ("The mean absolute error of "+ name + " on test data is:",\
           mean_absolute_error(test_Y, self.g_cv.best_estimator_.predict(self.test_X)))

Problem is R2 as you see is very bad. 0.08, but RMSE and Mean Absolute error seem to be very low. If I look at Mean Absolute Error, its just 0.56 sec. Which means on an average my predicted time is only half a second different from true time.

Something doesn't look right. Do I need to convert the predicted and original time variable back to linear scale from log scale before I calculate the above metrics (RMSE and MAE)?.

Best Answer

Once you take logs, your response is not in seconds. In effect it's unit free.

When you calculate mean absolute error on the log scale, it, too, is not a measurement in seconds.

It's (roughly-speaking) telling you something about the typical size of percentage error on the original scale.

An MAE(-of-the-logs) of 0.01 would tell you that typically your original values deviate by about 1% from the geometric mean.

Let $z_i=\log(y_i)$. Then an MAE of 0.01 in the logs means that $\frac{_1}{^n}|z_i-\bar{z}|=0.01$. Now on the original scale $\exp(\bar{z})$ is the geometric mean of the $y$-values, $\text{GM}(y)$.

Now consider observations sitting as far away from the mean as the MAE: $z_i=\bar{z}+ 0.01$ and $z_j = \bar{z}- 0.01$. Then

$y_i=\exp(z_i) = \exp(\bar{y}) \times \exp(0.01)$ $= 1.01005 \text{ GM}(y)\approx 1.01 \text{ GM}(y)$

or about 1% above the geometric mean. Similarly

$y_j=\exp(z_j)$ $= \exp(\bar{y}) \times \exp(-0.01)$ $= 0.99005 \text{ GM}(y)$ $\approx 0.99 \text{ GM}(y)$

or about 1% below the geometric mean.

Similarly an MAE (log scale) of 0.10 would tell you that typically your original values deviate by about 10.5% from the geometric mean. As you move further away (as MAE gets bigger) this convenient approximate-percentage relationship changes.

There's nothing wrong with calculating a MAE on the log scale as long as you don't misinterpret what it is. If you want an MAE on the original scale you'd need to compute it on that scale (but the fact that you're working with modelling the logs suggests that perhaps it may not actually be especially useful on the original scale)

Related Solutions

Solved – How to predict with log transformed variable

There are two parts to this answer. I will consider the utility of transformations for these data. Then I will suggest a quite different analysis.

Transformation needed and useful here? No

I see no reason whatsoever to transform distance or indeed noise either.

There is no requirement that responses or predictors in regression follow a normal (Gaussian) distribution. As a thought experiment, imagine $x$ is uniform on the integers and $y$ is $a + bx$. Then $y$ is also uniform; any regression program will retrieve the linear relation and produce the best possible figures of merit. Is it a problem in any sense that neither variable is normally distributed? No.

Looking more closely at the data, here are some normal quantile plots of the original variables and of Senun's transform $\log(\text{dist} + 0.1)$. I find these immensely more useful than (e.g.) Kolmogorov-Smirnov or Shapiro-Wilk tests: they show not only how well data fit a normal but also in what ways they fall short.

The labelled values on the vertical axes are those of the five-number summary, minimum, lower quartile, median, upper quartile and maximum. In the case of the distances, there are five distinct values with equal frequency, so they are reported as just those distinct values.

Note. The quantile plots here include only minor variations on conventional axis labelling and titling, but anyone interested in the details, or in a Stata implementation, may consult this paper.

The distances are thus a distribution with 5 spikes and cannot get close to normal; any one-to-one transformation must yield another distribution with 5 spikes. If there were a problem with mild skewness, the chosen transformation makes it worse by flipping the skewness from positive to negative and increasing its magnitude. This is shown by calculation of both moments-based and L-moments-based measures. If there were a problem with mildly non-normal kurtosis (there isn't), the transformation leaves it a little closer to the normal state.

Those unfamiliar with, but interested by, L-moments should start with the Wikipedia entry and might like to know that the L-skewness $\tau_3$ is 0 for every symmetric distribution, including the normal, while the L-kurtosis $\tau_4 \approx$ 0.123 for the normal. This is Stata output using moments and lmoments from SSC: Stata uses that definition of kurtosis for which the normal yields 3. The first L-moment measures location and is identical to the mean; the second is a measure of scale. Location and scale detail is naturally just context here and not otherwise germane to discussing transformations.

----------------------------------------------------------------
         n = 60 |       mean          SD    skewness    kurtosis
----------------+-----------------------------------------------
           dist |     15.000      14.261       0.795       2.263
log(dist + 0.1) |      1.666       2.118      -1.113       2.758
            leq |     62.494       3.261      -0.192       1.979
----------------------------------------------------------------

----------------------------------------------------------------
         n = 60 |        l_1         l_2         t_3         t_4
----------------+-----------------------------------------------
           dist |     15.000       7.729       0.229       0.012
log(dist + 0.1) |      1.666       1.087      -0.301       0.084
            leq |     62.494       1.887      -0.057       0.027
----------------------------------------------------------------

Noise is close to normally distributed, so even anyone worried about non-normality should leave it alone.

That deals with the mistaken stance that the transformation here is a good idea because the marginal distribution of distance is not normal. There is no problem; if there were, the distribution being a set of spikes makes at least hard to solve; and in practice the chosen transformation makes the situation worse even on its own criteria.

I'll flag a further detail. The ad hoc constant $0.1$ added before taking logarithms minimally needs a rationale: the absence of a rationale makes the transformation even more unsatisfactory.

That still leaves scope for a transformation to make sense because the relationship on new scale(s) would be closer to linear (or, a much smaller deal, because scatter around the relationship would then be closer to equal).

Here the main evidence lies in the first instance in scatter plots. The plot in the question shows that the transformation just splits data into two groups, which doesn't seem physically or statistically sensible. The scatter plot below doesn't indicate to me that transformation would help, but it's more crucial to think what kind of model makes sense any way.

A different analysis

We need more physical thinking. There is no doubt a substantial literature here which is being ignored. As an amateur alternative arm-waving I postulate that noise is here noise locally raised by road noise above some background and should diminish more rapidly at first and then more slowly with distance from the road. In fact some such thought may lie behind the unequal spacing in the sample design. So, one model matching those ideas is $\text{noise} = \alpha + \beta \exp(\gamma\ \text{distance})$ where we expect $\alpha, \beta > 0$ and $\gamma < 0$. Such models are a little tricky to fit as nonlinear least squares is implied but I'd assert that they make more sense than any linear model implying constant slope.

I get $\alpha = 58.75, \beta = 7.3565, \gamma = -.06770$ using nl in Stata.

The larger variability of lower noise levels needs some discussion, but presumably quite different conditions may be found at equal distances from the road. Clearly, what is important is not the distance but what else is in the gap (e.g. buildings and other structures, uneven topography).

Solved – Evaluation of log Vs. non log models

Yes, what you describe is a logical approach.

Aside (back-)transforming the response variable I would suggest considering a model that does not rely heavily on assumptions regarding the model's error-structure and/or the distribution of the response variable. Immediate regression-like alternatives would be robust regression and quantile regression. Similarly there is little reason not to use tree-based (like CHAID trees) or gradient-boosting approaches (like XGBoost) if you are mostly interested in prediction rather than statistical inference.

Best Answer

Related Solutions

Solved – How to predict with log transformed variable

Solved – Evaluation of log Vs. non log models

Related Question