Random Forest – How to Log-Transform Target Variable for Training a Random Forest Regressor

machine learningrandom forestregression

I have a variable that I want to model, which has a skewed distribution. Log transforming the var gives is a normal-like distribution. When training a Random Forest regressor on the non-transformed var, I get worse performance than when I log-tranform the var. I am bit puzzled about whether I should do this knowning that the random forest regressor is predicting the mean of the leafs. If trained on a log tranformed var, that means that the prediction is the mean of the logs of the values in the leafs. Which (when tranformed back) is not equal to the mean of the real values.

Any opinion?

Best Answer

I will be assuming that by "better performance" you mean better CV/validation performance, and not train one.

I want to invite you to think of what the effect of log-transforming the target variable is on single regression trees

Regression trees make splits in a way that minimizes the MSE, which (considering that we predict the mean) means that they minimize the sum of the variances of the target in the children nodes.

What happens if your target is skewed?
If your variable is skewed, high values will affect the variances and push your split points towards higher values - forcing your decision tree to make less balanced splits and trying to "isolate" the tail from the rest of the points.

Example of a single split on non-transformed and transformed data:

As a result overall, your trees (and so on RF) will be more affected by your high-end values if your data is not transformed - which means that they should be more accurate in predicting high values and a bit less on the lower ones.

If you log-transform you reduce the relative importance of these high values, and accept having more error on those while being more accurate on the bulk of your data. This might generalize better, and - in general - also makes sense. Indeed in the same regression, predicting $\hat{y}=105$ when $y=100$ is better than predicting $\hat{y}=15$ when $y=11$, because the error in relative terms often matters more than the absolute one.

Hope this was useful!