Solved – Random Forest Regression and trended time-series

random foresttime seriestrend

I am comparing a random forest model to a GLS model using a univariate time series that has a deterministic linear trend. I am going to add a linear time trend covariate (among other predictors) to the GLS model to account for the changing trend. To be consistent in my comparison, I was hoping to add this predictor to the random forest regression model as well. I have been looking for literature on this subject and can't find much.

Does anyone know if adding this type of predictor is inappropriate in a random forest regression for any reason? The random forest regression already includes time-lagged variables to account for autocorrelation.

Best Answer

RFs, of course, can identify and model a long-term trend in the data. However, the issue becomes more complicated when you are trying to forecast out to never seen before values, as you often are trying to do with time-series data. For example, if see that activity increases linearly over a period between 1915 and 2015, you would expect it to continue to do so in the future. RF, however, would not make that forecast. It would forecast all future variables to have the same activity as 2015.

from sklearn import ensemble
import numpy as np
years = np.arange(1916, 2016)
#the final year in the training data set is 2015
years = [[x] for x in years]
print 'Final year is %s ' %years[-1][0]
#say your ts goes up by 1 each year - a perfect linear trend
ts = np.arange(1,101)
est = ensemble.RandomForestClassifier().fit(years,ts)
print est.predict([[2013], [2014], [2015], [2016] , [2017], [2018]])

The above script will print 2013, 2014, 2015, 2015, 2015, 2015. Adding lag variables into the RF does not help in this regard. So careful. I'm not sure if adding trend data to your RF is gonna do what you think it will.

Related Question