Solved – XG Boost vs Random Forest for Time Series Regression Forecasting

boostingmachine learningrandom forestregressiontime series

I am using R's implementation of XGboost and Random forest to generate 1-day ahead forecasts for revenue. I have about 200 rows and 50 predictors. (As I go further in time I have more data so more rows).

The XGBoost model with the below parameters is 6% worse than an off the shelf random forest model with respect to the mean square error. Furthermore, the random forest model is slightly more accurate than an autoregressive time series forecast model. (I haven't tried Arimax yet tbh).

For Xgboost, I tried changing eta to 0.02 and num_rounds to 8,000, but now it takes a long time to run. Is there some kind of guide that I can use to improve the forecast accuracy of the xgboost model? Am I using the multi-core feature properly?

I feel as though I am reaching around in the dark with marginal payoff. If it helps, I am using a core I7 with 12gb of ram, running Windows 7 Professional. I appreciate your assistance!

rf.mod = randomForest(act ~ ., data = train)
rf.pred = predict(rf.mod, newdata = test)
#####################################
train_x <- sparse.model.matrix(~., data = train[,2:ncol(train)])
train_y <- train$act
test_x <- sparse.model.matrix(~., data = test)

xgtrain <- xgb.DMatrix(data = train_x, label= train_y)
xgtest <- xgb.DMatrix(data = test_x)

num_rounds <- 1000 

evalgini <- function(preds, dtrain) {
  labels <- getinfo(dtrain, "label")
  err <- NormalizedGini(as.numeric(labels),as.numeric(preds))
  return(list(metric = "Gini", value = err))
}
param <- list("objective" = "reg:linear",
              "eta" = 0.2,
              "min_child_weight" = 5,
              "subsample" = .8,
              "colsample_bytree" = .8,
              "scale_pos_weight" = 1.0,
              "max_depth" = 8)
xg.mod <- xgb.train(params = param, data = xgtrain, feval = evalgini, nround=num_rounds, print.every.n = num_rounds, maximize = TRUE)
xg.pred <- predict(xg.mod ,xgtest)

Best Answer

The easiest way to handle 'tuning' of the num_rounds parameter is to let XGBoost do it for you. You can set theearly_stopping_rounds parameter to n in the train method and the model will stop training once error hasn't decreased for n rounds.

See this example from the Liberty Mutual Kaggle Competition:

As noted in the code below, you'll need to also use the watchlist parameter to enable early stopping.

    # You can write R code here and then click "Run" to run it on our platform

# The readr library is the best way to read and write CSV files in R
library(readr)
library(xgboost)
library(data.table)
library(Matrix)
library(caret)

# The competition datafiles are in the directory ../input
# Read competition data files:
train <- read_csv("../input/train.csv")
test <- read_csv("../input/test.csv")

# Generate output files with write_csv(), plot() or ggplot()
# Any files you write to the current directory get shown as outputs

# keep copy of ID variables for test and train data
train_Id <- train$Id
    test_Id <- test$Id

# response variable from training data
train_y <- train$Hazard

# predictor variables from training
train_x <- subset(train, select = -c(Id, Hazard))
train_x <- sparse.model.matrix(~., data = train_x)

# predictor variables from test
test_x <- subset(test, select = -c(Id))
test_x <- sparse.model.matrix(~., data = test_x)

# Set xgboost parameters
param <- list("objective" = "reg:linear",
              "eta" = 0.05,
              "min_child_weight" = 10,
              "subsample" = .8,
              "colsample_bytree" = .8,
              "scale_pos_weight" = 1.0,
              "max_depth" = 5)

# Using 5000 rows for early stopping. 
offset <- 5000
num_rounds <- 1000

# Set xgboost test and training and validation datasets
xgtest <- xgb.DMatrix(data = test_x)
xgtrain <- xgb.DMatrix(data = train_x[offset:nrow(train_x),], label= train_y[offset:nrow(train_x)])
xgval <-  xgb.DMatrix(data = train_x[1:offset,], label= train_y[1:offset])

# setup watchlist to enable train and validation, validation must be first for early stopping
watchlist <- list(val=xgval, train=xgtrain)
# to train with watchlist, use xgb.train, which contains more advanced features

# this will use default evaluation metric = rmse which we want to minimise
bst1 <- xgb.train(params = param, data = xgtrain, nround=num_rounds, print.every.n = 20, watchlist=watchlist, early.stop.round = 50, maximize = FALSE)