Solved – non-linear regression: Residual Plots and RMSE on raw and log target variable

boostingnonlinear regressionregressionresiduals

I have rental prices of the houses, the distribution is a bit skewed (1.84 by scipy.stats.skew).

Fig. 1: Image1

I run XGBRegressor on non-transformed data (1.) and on log-transformed target variable (2.). In 2. I also exponentiate the target variable (bringing in to the raw scale) to compare the residuals plots.

  1. I run XGBRegressor on non-tranformed data and I get test RMSE: 177 and a following residual plot and predicted vs real prices. The plot exhibits that data suffer from heteroscedasticity.

Fig. 2:Image2

  1. If I run XGBRegressor on log tranformed data, I get test RMSE: 180.5 (if I exponentiate back) and a following residual plot and predicted vs real prices (on log data).

Fig.3Image3
If I exponetiate the log target variable and predictions from 2. I get following plots. They look almost as the plots from 1. and it shows the same problem with heteroscedasticity. So, I did not get any improvement.

Fig.4:Image4

My questions are:

  1. From what I see, the log transformation did resolve the skewness but it did not improve the RMSE (I guess it is because I use a non-linear algorithm, so the algorithm is not affected by the skewness). That means, I should not log transform it and do not pay attention to the residual plots. Is it correct?

  2. Should I be bothered by the skewness when I use non-linear regressors and should I do log transformation? If not, but I still see the improvement in metrics (RMSE) on the log-data, am I allowed to say: 'I logged it because it gave me the improvement in my metrics' (that sound as a very weak argumentation)? What is the explanation for this behaviour? Is it because of convergence of the optimisation algorithm?

  3. Also, is a residual plot still a good diagnostics when I use a non-linear regressor?

  4. When I use a linear regressor and my data is skewed, but the RMSE of non-transformed data is lower than of log-transformed data, what should I do and what does it mean (or why does it happen)?

Best Answer

Although your question seems to be based on use of a boosted regression tree algorithm (XGBRegressor, with which I have no experience), your issues seem to be some that are also faced in standard linear regression, on which I base much of my answer. How to proceed depends on what you are trying to accomplish. To start, a few issues need to be clarified.

First, a regression model is typically considered linear if it is linear in the parameters. In this terminology, non-linear transformations of predictor variables or outcome/target variables do not by themselves make a model non-linear. As you are evidently using boosted regression trees instead of a classical linear regression, it's not clear that the linear/non-linear distinction really applies, anyway.

Second, there is no need for the target variable to have a symmetric distribution in standard linear regression, although that can help with regression trees. As your appropriate focus on the residual plots indicates, the distribution of residuals is important.

Third, RMSE of your entire data set (which is what you seem to be showing) might not be the best measure of the quality of your model. Particularly if you intend to use the model for predictions on new cases, cross-validation or bootstrapping could provide much better estimates of such future performance.

Now to your questions:

  1. The regressions for the linear and log-transformed prices attempt different things: the first tries to minimize the error in absolute terms (e.g., dollars), the second tries to minimize the error in relative/fractional terms (e.g., percentage error in predicted price). Which type of error do you care about for your application? If you care more about fractional errors, you should use the log transform and you shouldn't worry that the error in absolute terms appears bigger when you back-transform from the log scale. You should always, however, pay attention to residuals in whichever scale you choose.

  2. Skewness in the target variable is an issue in standard linear regression provided that some predictor variables are also appropriately skewed so that residuals are not skewed. Transformation of both predictor and target variables is often needed to meet the assumptions of a linear regression and to produce well-behaved residuals. With tree-based regression approaches that use mean values of target variables to choose cutoffs for trees, removing skewness in the target variable can be recommended; the authors of ISLR do a log transform for this purpose in their example of a regression tree (pp. 304 and following). Log transformation means that residuals will be in fractional rather than absolute terms, which seems to make sense for these types of data.

  3. You should always pay attention to residual plots.

  4. Comparing the RMSE of the regression for non-transformed prices against the exponentiated RMSE of the regression for log-transformed prices isn't always very useful. In your log-transformed analysis, the case with the largest absolute error in the log (fractional) scale is also the lowest in absolute predicted price. That case will make a large contribution to the RMSE error in the log-transformed scale (as will several other low-price cases), but perhaps a very small contribution to RMSE in absolute terms in the regression for non-transformed prices. That might account for your observation.

In terms of how to proceed, it looks like the log-transformation of prices is useful but that your model doesn't deal too well with some of the most extremely low prices. You might need to incorporate your knowledge about the underlying subject matter, e.g. whether there is something special about such cases (other than that they don't fit well) that make them inappropriate to include in the model (for example, you might need to exclude all rents that are charged among family members, which might be lower than market rents, or rents in otherwise subsidized units if any), or whether transformations of some predictor variables might improve performance. Different choices of the tuning parameters for the boosting might help. And again, you should consider a different measure of model quality than RMSE on the entire data set.

Finally, the classic linear regression approach can outperform tree-based approaches in many situations. You could try a standard linear regression with appropriate knowledge-based selection and transformations of predictors and target variable values, with cross-validation or bootstrapping to validate your modeling approach.