Extrapolation using machine learning models under specific assumptions

extrapolationneural networkspartial-dependency-plotsrandom forest

I have a problem that requires inherently extrapolation. I am aware that this a crucial matter with most (if not all) machine learning models.

Yet, given the physical phenomenon underlying the experiment, there is some expert knowledge that could be used to validate the extrapolation for this case. Example: With linear regression, one could combine the model with previous knowledge to say it is safe to use the parameters of the model for extrapolation in that particular case.

My case: After training a Random Forest model on a large data set, I applied the model to the new data set, and 20% of the data points are outside the calibration zone. I decided to compare the behaviour of the Random Forest model in both datasets using partial dependence plots, as RF is non-parametric. It shows clearly the threshold-based behaviour of decision trees, where any value outside of the calibration zone is grouped together with the extremes of the calibration zone.

Yet, the plots do suggest a general trend that could be "safely" extrapolated by a multiple linear regression model, as long as we hold the parameters of the model as plausible/realistic enough.

My initial idea then is to use some sort of stacking to combine RF with some linear or polynomial model and see if this could be achieved. I wonder if there are any fundamental flaws in this rationale or if this could actually be achieved?

A second point is: Are there alternatives to this issue, maybe neural networks or different models that could extrapolate to a certain point, so then I could validate the behaviour and decide if it is realistic or not?

Partial Dependence plots

Best Answer

My initial idea then is to use some sort of stacking to combine RF with some linear or polynomial model

Random Forest is not a proper tool for this. I would try Gaussian Process, trying Neural Network and Squared Exponential kernels, modelling the mean as a linear function. See the lower three charts in an example from Golding & Purse 2016:

enter image description here

Surely there's much more you can achieve with Gaussian Processes, you can write any formula and design any kernel functions for your purpose and then test them.

A second point is: Are there alternatives to this issue, maybe neural networks or different models that could extrapolate to a certain point, so then I could validate the behaviour and decide if it is realistic or not?

In extrapolation, you are entering the territory which is totally uknown to you. You need to make some assumptions about it. You need to have a model of the data that will tell you what happens beyond the range of your data. The question then is how much that model is valid. You could do a crossvalidation by cutting of some of your data at the edges and using them as a validation data, and then assume that the data will behave the same outside of your data range. Of course, this assumption might be wrong, but you need to make some, when you are predicting about an unknown territory.

Related Question