Solved – Using a Random Forest for Time Series Data

random foresttime series

This is a simple question, is it okay to use a Random Forest model on Time Series Data? I ask this because, in a random forest model, we perform bootstrapping of observations where we randomly sample from the training set with replacement. Doesn't this ruin the "ordering of observations" in the model, since it being time series data.
I'm asking this with context to financial data, say I'm doing a classification type problem to buy/not buy an asset, and I collect daily data for some features to predict this variable.

Best Answer

It works well but only if the features are properly prepared so that the order of the lines is not important anymore.

E.g. for a univariate time series $y_i$, you would use $y_i$ as response and e.g. the following features:

  1. Lagged versions $y_{i-1}$, $y_{i-2}$, $y_{i-3}$ etc.

  2. Differences of appropriate order, e.g. $y_{i-1} - y_{i-2}$, $y_{i-1} - y_{i-8}$ (if there is weekly seasonality expected and the observations occur daily) etc.

  3. Integer or dummy coded periodic time info such as month in year, week day, hour of day, minute in hour etc.

The same approach works for different modelling techniques, including linear regression, neural nets, boosted trees etc.

An example is the following (using a binary target "temperature increase" (y/n)):

library(tidyverse)
library(lubridate)
library(ranger)
library(MetricsWeighted) # AUC

# Import
raw <- read.csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/daily-min-temperatures.csv")

# Explore
str(raw)
head(raw)
summary(raw)
hist(raw$Temp, breaks = "FD")

# Prepare and add binary response
prep <- raw %>% 
  mutate(Date = ymd(Date),
         y = year(Date),
         m = month(Date),
         d = day(Date),
         increase = 0 + (Temp > lag(Temp)))

with(prep, table(y))
summary(prep)

# Plot full data -> year as seasonality
ggplot(data = prep, aes(x = Date, y = Temp))+
  geom_line(color = "#00AFBB", size = 2) +
  scale_x_date()

# No visible within year seasonality
prep %>% 
  filter(y == 1987) %>% 
ggplot(aes(x = Date, y = Temp))+
  geom_line(color = "#00AFBB", size = 2) +
  scale_x_date()

# Add some lags and diffs & remove incomplete rows
prep <- prep %>% 
  mutate(lag1 = lag(Temp),
         lag2 = lag(Temp, 2L),
         lag3 = lag(Temp, 3L),
         dif1 = lag1 - lag2,
         dif2 = lag2 - lag3) %>% 
  filter(complete.cases(.))

# Train/valid split in blocks
valid <- prep %>% 
  filter(y == 1990)
train <- prep %>% 
  filter(y < 1990)

# Models
y <- "increase" # response
x <- c("lag1", "lag2", "lag3", "dif1", "dif2", "y", "m", "d") # covariables
form <- reformulate(x, y)

# Logistic model: Linear dependence between difs and lags
fit_glm <- glm(form, 
               data = train, 
               family = binomial()) 
summary(fit_glm)

# Random forest
fit_rf <- ranger(form, 
                 data = train,
                 seed = 345345, 
                 importance = "impurity", 
                 probability = TRUE)
fit_rf
barplot(-sort(-importance(fit_rf))) # Variable importance

# Evaluate on 1990 for glm by looking at ROC AUC
pred_glm <- predict(fit_glm, valid, type = "response")
AUC(valid[[y]], pred_glm) # 0.684 ROC AUC

# Then for rf
pred_rf <- predict(fit_rf, valid)$predictions[, 2]
AUC(valid[[y]], pred_rf)    # 0.702 ROC AUC

# view OOB residuals of rf within one month to see if structure is left over
random_month <- train %>% 
  mutate(residuals = increase - fit_rf$predictions[, 2]) %>% 
  filter(y == 1987, m == 3) 

ggplot(random_month, aes(x = Date, y = residuals))+
  geom_line(color = "#00AFBB", size = 2) +
  scale_x_date()

Replacing variables "y" and "m" by factors would probably improve the logistic regression. But since the question was about random forests, I leave this to the reader.