Solved – Post processing random forests using regularised regression: what about bias

elastic netmachine learningrandom forestregularization

I have been playing around with post processing the results of the random forest for regression machine learning algorithm in order to try and do better than the default mean of all trees prediction. Chapter 16 of Elements of statistical learning talks about how and why this can be done using regularised regression (in that case elastic net).

On the other hand, I routinely re-calibrate the results of random forests using a logistic model, since I often see the kind of bias discussed in this question, and find that a simple linear logistic regression model tends to account for it. I have found that the calibration needs to be done on the ensemble prediction, attempting to calibrate individual trees doesn't work nearly as well.

I am unsure then of how to proceed when you want to do both of these things. If you do the elastic net regression first, you are finding the optimal weighted mean/subset of the forest ensemble for the biased prediction. I see no reason to be sure that this will be the optimal combination for a re-calibrated prediction. I don't see how the elastic net analysis could be run simultaneously with re-calibration, as the calibration is non-linear.

The non-linear problem of mutuality optimising the calibration parameters as well as weights and sub-set selection is doable is principle, but seem almost intractable computationally for large data sets.

Can anyone suggest an efficient path for achieving both of these aims in an optimised way?

Best Answer

I want to add some thoughts on the problem at hand, so that the discussion may roll on. However, I propose something else to think about, so others may comment on this.

When reading this post and the post in the highlighted link, we try to overcome the bias in the RF (possibly in the tails) and to correct biased output of the RF, by applying another method, e.g. elastic net. However doing an elastic net first, and then doing the RF or vice versa has not led to a well-received solution. We try to stack methods one after another to overcome their downsides, but with no success.

The Problem may be buried in the procedural usage of two methods after each method gave its word on all features in the equation. I'm not speaking about parallelism here. I mean, what if we gave word for every feature with a function of its own.

When we do a RF to predict the outcome of some function $f(x)$ we try to model all features $x_1 x_2$, the whole equation, all at once with one method. In other words summed trees or the opinion of all trees come to an averaged conclusion on how to deal with all features in the equation. The elastic net does the same when choosing $\lambda$ for regularization. But we are doing it on all features at once, and that means, that one or maybe two features have a downside from this. Someone has to draw the short straw.

Maybe the Find as the previous poster stated is not the real Find, as it wants to find another method to correct the misbehavior. Maybe the real Find lies not in stacking methods after another. What if we can try to give word to every feature by applying one method for every feature. Like $y = a_0 * + a_1*f_1(x_1) + ... + a_n*f_n(x_n)$. I did not heard of it, until I read it myself. I'm talking about the interpret ML package by microsoft. Although at the moment

It only supports trees as base learners (Dr. Robert Kübler)

it may shed some light on the problem.

I just wanted to give some thought that we may have to look for a different way instead of 'correcting' our route.

Paper Microsoft: https://arxiv.org/pdf/1909.09223.pdf

Medium: https://towardsdatascience.com/the-explainable-boosting-machine-f24152509ebb