Solved – Reduce Random Forest model memory size

model-evaluationrrandom forestregression

I've created a regression model on my data using random forests in R. The output is quite large, I'm wondering if there's any way to reduce this to only the necessary pieces to make a prediction?

The training data set contains 20 variables and ~45,000 rows, which is also large. My code is listed below.

data <- readRDS("data.Rds")

require("data.table")
require("doParallel")
require("randomForest")

train <- data[ which(set == "train")]
test <- data[ which(set == "test")]
rm(data)

x <- data.table(train[, 2:21, with=FALSE])
y <- as.vector(as.matrix(train[, 23, with=FALSE]))

cl <- makeCluster(detectCores())
registerDoParallel(cl, cores=4)
time <- system.time({rf.fit <- foreach(ntree=rep(500, 6),
                               .combine=combine,
                               .multicombine=TRUE,
                               .packages="randomForest") %dopar% 
                   {randomForest(x, y, ntree=ntree)}})
stopCluster(cl)

saveRDS(rf.fit, "rf.fit.Rds")

The output of this is ~230 MB. Once I have the model, is it possible to reduce the size to make it easier to work with? My goals with this are to identify the important variables, and make a prediction on new data.

Best Answer

I used this function to reduce my default caret-output from 137 MB to 3 MB. You can still use this model for prediction with $finalModel

## Clean Model to Save Memory

## http://stats.stackexchange.com/questions/102667/reduce-random-forest-model-memory-size
stripRF <- function(cm) {
  cm$finalModel$predicted <- NULL 
  cm$finalModel$oob.times <- NULL 
  cm$finalModel$y <- NULL
  cm$finalModel$votes <- NULL
  cm$control$indexOut <- NULL
  cm$control$index    <- NULL
  cm$trainingData <- NULL

  attr(cm$terms,".Environment") <- c()
  attr(cm$formula,".Environment") <- c()

  cm
}
Related Question