Solved – Model that optimizes mean absolute error always gives same prediction

boostingleast-absolute-deviationsmachine learningregression

My gradient boosting regression model (GBM) is trained to minimize mean absolute error (MAE) but gives the same prediction for every record on my highly skewed dataset. I believe there is a quick fix to the immediate problem (use RMSE) but my situation is complicated, and I worry that using RMSE will lead to a new set of problems that are much worse.

Background:
My model must predict a continuous percentage — not a discrete class — and unfortunately about 85% of my records have a target / response value of 100%. Of the remaining 15%, about half the records have a target of 0% while the others have values somewhere between 0% and 100%. I suspect that my GBM is producing the same score for every record because the median target value at each tree's terminal nodes is 100%.

MAE vs RMSE: I want to minimize MAE because from my company's business perspective a prediction that is off by 10% is exactly twice as bad as a prediction that is off by 5%. Using RMSE instead of MAE fixes my identical predictions problem but heavily skews my predictions towards the outlier and produces many predictions in the range of 15%-20%, well below the market rate. My company would be forced out of business if it had to offer rates so low because no company would buy the product. On the other hand, my company would also be forced out of business if it always offered rates of 100%, since any outcome less than 100% would be a financial loss.

Regression vs Classification: I can't frame this as a classification problem because if I pick some arbitrary threshold (say, 60%) for binary classification, my company basically concedes the entire lower end of the market, since it wouldn't be able to differentiate a 40% customer from a 55% customer, since both would be lower than the example 60% threshold.

What should I do?

Best Answer

Probably, you can log transform (or any other scale transform) the target variable and then use RMSE. It might remove the high outlier impact.

Related Question