Solved – non-linear regression: Residual Plots and RMSE on raw and log target variable

boostingnonlinear regressionregressionresiduals

I have rental prices of the houses, the distribution is a bit skewed (1.84 by scipy.stats.skew).

Fig. 1:

I run XGBRegressor on non-transformed data (1.) and on log-transformed target variable (2.). In 2. I also exponentiate the target variable (bringing in to the raw scale) to compare the residuals plots.

I run XGBRegressor on non-tranformed data and I get test RMSE: 177 and a following residual plot and predicted vs real prices. The plot exhibits that data suffer from heteroscedasticity.

Fig. 2:

If I run XGBRegressor on log tranformed data, I get test RMSE: 180.5 (if I exponentiate back) and a following residual plot and predicted vs real prices (on log data).

Fig.3
If I exponetiate the log target variable and predictions from 2. I get following plots. They look almost as the plots from 1. and it shows the same problem with heteroscedasticity. So, I did not get any improvement.

Fig.4:

My questions are:

From what I see, the log transformation did resolve the skewness but it did not improve the RMSE (I guess it is because I use a non-linear algorithm, so the algorithm is not affected by the skewness). That means, I should not log transform it and do not pay attention to the residual plots. Is it correct?
Should I be bothered by the skewness when I use non-linear regressors and should I do log transformation? If not, but I still see the improvement in metrics (RMSE) on the log-data, am I allowed to say: 'I logged it because it gave me the improvement in my metrics' (that sound as a very weak argumentation)? What is the explanation for this behaviour? Is it because of convergence of the optimisation algorithm?
Also, is a residual plot still a good diagnostics when I use a non-linear regressor?
When I use a linear regressor and my data is skewed, but the RMSE of non-transformed data is lower than of log-transformed data, what should I do and what does it mean (or why does it happen)?

Best Answer

Although your question seems to be based on use of a boosted regression tree algorithm (XGBRegressor, with which I have no experience), your issues seem to be some that are also faced in standard linear regression, on which I base much of my answer. How to proceed depends on what you are trying to accomplish. To start, a few issues need to be clarified.

First, a regression model is typically considered linear if it is linear in the parameters. In this terminology, non-linear transformations of predictor variables or outcome/target variables do not by themselves make a model non-linear. As you are evidently using boosted regression trees instead of a classical linear regression, it's not clear that the linear/non-linear distinction really applies, anyway.

Second, there is no need for the target variable to have a symmetric distribution in standard linear regression, although that can help with regression trees. As your appropriate focus on the residual plots indicates, the distribution of residuals is important.

Third, RMSE of your entire data set (which is what you seem to be showing) might not be the best measure of the quality of your model. Particularly if you intend to use the model for predictions on new cases, cross-validation or bootstrapping could provide much better estimates of such future performance.

Now to your questions:

The regressions for the linear and log-transformed prices attempt different things: the first tries to minimize the error in absolute terms (e.g., dollars), the second tries to minimize the error in relative/fractional terms (e.g., percentage error in predicted price). Which type of error do you care about for your application? If you care more about fractional errors, you should use the log transform and you shouldn't worry that the error in absolute terms appears bigger when you back-transform from the log scale. You should always, however, pay attention to residuals in whichever scale you choose.
Skewness in the target variable is an issue in standard linear regression provided that some predictor variables are also appropriately skewed so that residuals are not skewed. Transformation of both predictor and target variables is often needed to meet the assumptions of a linear regression and to produce well-behaved residuals. With tree-based regression approaches that use mean values of target variables to choose cutoffs for trees, removing skewness in the target variable can be recommended; the authors of ISLR do a log transform for this purpose in their example of a regression tree (pp. 304 and following). Log transformation means that residuals will be in fractional rather than absolute terms, which seems to make sense for these types of data.
You should always pay attention to residual plots.
Comparing the RMSE of the regression for non-transformed prices against the exponentiated RMSE of the regression for log-transformed prices isn't always very useful. In your log-transformed analysis, the case with the largest absolute error in the log (fractional) scale is also the lowest in absolute predicted price. That case will make a large contribution to the RMSE error in the log-transformed scale (as will several other low-price cases), but perhaps a very small contribution to RMSE in absolute terms in the regression for non-transformed prices. That might account for your observation.

In terms of how to proceed, it looks like the log-transformation of prices is useful but that your model doesn't deal too well with some of the most extremely low prices. You might need to incorporate your knowledge about the underlying subject matter, e.g. whether there is something special about such cases (other than that they don't fit well) that make them inappropriate to include in the model (for example, you might need to exclude all rents that are charged among family members, which might be lower than market rents, or rents in otherwise subsidized units if any), or whether transformations of some predictor variables might improve performance. Different choices of the tuning parameters for the boosting might help. And again, you should consider a different measure of model quality than RMSE on the entire data set.

Finally, the classic linear regression approach can outperform tree-based approaches in many situations. You could try a standard linear regression with appropriate knowledge-based selection and transformations of predictors and target variable values, with cross-validation or bootstrapping to validate your modeling approach.

Related Solutions

Solved – Interpreting linear regression residual plots using R

Judging from your diagnostic plots I wouldn't worry about heteroscedasticity and use this model.

However, you can test if modelling the variance of residuals would improve the model:

library(nlme)
fit1 <- gls(value ~ wd, data=df)

fit2 <- gls(value ~ wd, data=df, weights = varIdent(form = ~ 1 | wd))


anova(fit1, fit2)

#     Model df      AIC      BIC    logLik   Test  L.Ratio p-value
#fit1     1  8 5332.321 5364.509 -2658.160                        
#fit2     2 14 5286.911 5343.239 -2629.456 1 vs 2 57.41009  <.0001

As you see, the model, which includes variance parameters, seems slightly better. However, the standard errors or t-values are only slightly different from your original model and you wouldn't infer different conclusions:

summary(fit2)
#Generalized least squares fit by REML
#  Model: value ~ wd 
#  Data: df 
#       AIC      BIC    logLik
#  5286.911 5343.239 -2629.455
#
#Variance function:
# Structure: Different standard deviations per stratum
# Formula: ~1 | wd 
# Parameter estimates:
#Wednesday  Thursday    Friday  Saturday    Sunday    Monday   Tuesday 
#1.0000000 1.2789517 1.2876130 0.9901914 0.5975224 1.1128592 1.5904740 
#
#Coefficients:
#                Value Std.Error    t-value p-value
#(Intercept)  4242.602  20.92298  202.77237       0
#wdMonday      148.931  27.65462    5.38538       0
#wdSaturday  -2661.925  26.39433 -100.85215       0
#wdSunday    -3113.783  23.06607 -134.99409       0
#wdThursday    299.647  29.49021   10.16091       0
#wdTuesday     189.044  33.25205    5.68520       0
#wdWednesday   412.524  26.49180   15.57175       0

anova(fit)
#Analysis of Variance Table
#
#Response: value
#           Df    Sum Sq   Mean Sq F value    Pr(>F)    
#wd          6 834554455 139092409  6538.6 < 2.2e-16 ***
#Residuals 413   8785585     21273 

anova(fit2)
#Denom. DF: 413 
#            numDF  F-value p-value
#(Intercept)     1 210610.5  <.0001
#wd              6  11982.3  <.0001

You could model the weekdays as an ordered factor, which would cause the regression functions to use polynomial contrasts.

Solved – Regression RMSE when dependent variable is log transformed

Once you take logs, your response is not in seconds. In effect it's unit free.

When you calculate mean absolute error on the log scale, it, too, is not a measurement in seconds.

It's (roughly-speaking) telling you something about the typical size of percentage error on the original scale.

An MAE(-of-the-logs) of 0.01 would tell you that typically your original values deviate by about 1% from the geometric mean.

Let $z_i=\log(y_i)$. Then an MAE of 0.01 in the logs means that $\frac{_1}{^n}|z_i-\bar{z}|=0.01$. Now on the original scale $\exp(\bar{z})$ is the geometric mean of the $y$-values, $\text{GM}(y)$.

Now consider observations sitting as far away from the mean as the MAE: $z_i=\bar{z}+ 0.01$ and $z_j = \bar{z}- 0.01$. Then

$y_i=\exp(z_i) = \exp(\bar{y}) \times \exp(0.01)$ $= 1.01005 \text{ GM}(y)\approx 1.01 \text{ GM}(y)$

or about 1% above the geometric mean. Similarly

$y_j=\exp(z_j)$ $= \exp(\bar{y}) \times \exp(-0.01)$ $= 0.99005 \text{ GM}(y)$ $\approx 0.99 \text{ GM}(y)$

or about 1% below the geometric mean.

Similarly an MAE (log scale) of 0.10 would tell you that typically your original values deviate by about 10.5% from the geometric mean. As you move further away (as MAE gets bigger) this convenient approximate-percentage relationship changes.

There's nothing wrong with calculating a MAE on the log scale as long as you don't misinterpret what it is. If you want an MAE on the original scale you'd need to compute it on that scale (but the fact that you're working with modelling the logs suggests that perhaps it may not actually be especially useful on the original scale)

Best Answer

Related Solutions

Solved – Interpreting linear regression residual plots using R

Solved – Regression RMSE when dependent variable is log transformed

Related Question