Solved – Advice on running random forests on a large dataset

computational-statisticslarge datarrandom forest

I am planning to run random forests to predict a binary outcome. I have a relatively (from my point of view) large dataset, composed of 500,000 units and around 100 features (a mix of continuous, binary and categorical variables). I am planning to use the rf package from the caret library in R.

I used to run random forests on smaller datasets on my personal laptop or on a small AWS-EC2.
Any advice on how to run it efficiently in terms of computational power? For instance,

  • if I opt for an AWS-EC2 server, which one should I use?
  • Should I consider using SparkR (a frontend for Spark on R)?
  • Should I consider parallel computing?
  • How much time should I expect the algorithm to obtain a solution?

Thank you so much! 🙂

Best Answer

Some hints:

  • 500k rows with 100 columns do not impose problems to load and prepare, even on a normal laptop. No need for big data tools like spark. Spark is good in situations with hundreds of millions of rows.
  • Good random forest implementations like ranger (available in caret) are fully parallelized. The more cores, the better.
  • Random forests do not scale too well to large data. Why? Their basic idea is to pool a lot of very deep trees. But growing deep trees eats a lot of resources. Playing with parameters like max.depth and num.trees help to reduce computational time. Still, they are not ideal. In your situation, maybe 20 minutes with ranger on a normal laptop would be sufficient. (A rough guess).
    library(ranger)
    n <- 500000
    p <- 100
    df <- data.frame(matrix(rnorm(n * p), ncol = p))
    df$y <- factor(sample(0:1, n, TRUE))
    object.size(df) # 400 MB

    head(df)

    fit <- ranger(y ~ ., 
                  data = df, 
                  num.trees = 500,
                  max.depth = 8,
                  probability = TRUE)
    fit

enter image description here

With higher max.depth, quite a lot of additional time will be required.