Solved – Advice on running random forests on a large dataset

computational-statisticslarge datarrandom forest

I am planning to run random forests to predict a binary outcome. I have a relatively (from my point of view) large dataset, composed of 500,000 units and around 100 features (a mix of continuous, binary and categorical variables). I am planning to use the rf package from the caret library in R.

I used to run random forests on smaller datasets on my personal laptop or on a small AWS-EC2.
Any advice on how to run it efficiently in terms of computational power? For instance,

if I opt for an AWS-EC2 server, which one should I use?
Should I consider using SparkR (a frontend for Spark on R)?
Should I consider parallel computing?
How much time should I expect the algorithm to obtain a solution?

Thank you so much! 🙂

Best Answer

Some hints:

500k rows with 100 columns do not impose problems to load and prepare, even on a normal laptop. No need for big data tools like spark. Spark is good in situations with hundreds of millions of rows.
Good random forest implementations like ranger (available in caret) are fully parallelized. The more cores, the better.
Random forests do not scale too well to large data. Why? Their basic idea is to pool a lot of very deep trees. But growing deep trees eats a lot of resources. Playing with parameters like max.depth and num.trees help to reduce computational time. Still, they are not ideal. In your situation, maybe 20 minutes with ranger on a normal laptop would be sufficient. (A rough guess).

    library(ranger)
    n <- 500000
    p <- 100
    df <- data.frame(matrix(rnorm(n * p), ncol = p))
    df$y <- factor(sample(0:1, n, TRUE))
    object.size(df) # 400 MB

    head(df)

    fit <- ranger(y ~ ., 
                  data = df, 
                  num.trees = 500,
                  max.depth = 8,
                  probability = TRUE)
    fit

With higher max.depth, quite a lot of additional time will be required.

Related Solutions

Solved – Resampling large dataset

There is no reason why bootstrapping would be inappropriate with a large dataset, if inappropriate means deliver bad results because of the size. However, depending on how large the dataset and how complex the calculations that need to be done, there might be cost or efficiency problems.

Solved – unbalanced samples random Forests

you probably want

sampsize(c(70,70))

You can also play with class weights which influence the gini impurity function for picking splits. Check out this paper

Best Answer

Related Solutions

Solved – Resampling large dataset

Solved – unbalanced samples random Forests

Related Question