Solved – Gradient boosting regression trained on skewed data

boostingregressionskewness

My target feature is right-skewed.
I want to apply gradient boosting regression algorithm to predict it but I'm not sure what kind of preprocessing should I apply.

As gradient boosting is based on decision trees the common intuition declares that logarithmic transformation won't help much.
Another way is to remove outliers based on a threshold and to look at the performance of the model. But I'm not sure this is an optimal solution.

Any suggestion?

Best Answer

I think we must first consider if the outliers are "true data" or just simply noise/corrupted input.

If they are corrupted data (e.g. an adult human weighting 775 kg) then it is perfectly reasonable to exclude these instances from further analysis. If these instances though are reasonable data we might want to work with them, rather than around them. A first obvious fix that does not involve data transformations, would be to employ a custom objective function approximating a MAE, a Huberised loss or a quantile loss. That would allow minimising the effect of instances that might seem highly unnatural. In general and without containing yourself to gradient boosting, I would suggest looking into robust statistics to get a better idea of how one would classically deal with potentially noisy and/or skewed data (for example using a GAM with a scaled-T distribution for the family of the response).

As you say, potentially transforming then back-transforming your data (log(x+1) being a common choice of strictly non-negative data) is also a potentially reasonable approach. Go for it, just do not get too crazy because while model interpretability is not a prime concern when predicting, if the transformation is just too convoluted (e.g. through some arbitrary power tranformation), debugging and/or improving an existing model becomes even more complicated than it should.

Finally, I would suggest you look into some data competitions that are concerned with skewed variables themselves (e.g. the Allstate Insurance claims severity predictions), these guys have some nifty ideas too!

Best Answer

Related Solutions

Solved – Heuristic Feature Selection for Gradient Boosting

Gradient Boosting – How to Keep Classification Prediction in [0,1]

Related Question