I am planning to run random forests to predict a binary outcome. I have a relatively (from my point of view) large dataset, composed of 500,000 units and around 100 features (a mix of continuous, binary and categorical variables). I am planning to use the rf
package from the caret
library in R
.
I used to run random forests on smaller datasets on my personal laptop or on a small AWS-EC2.
Any advice on how to run it efficiently in terms of computational power? For instance,
- if I opt for an AWS-EC2 server, which one should I use?
- Should I consider using SparkR (a frontend for Spark on R)?
- Should I consider parallel computing?
- How much time should I expect the algorithm to obtain a solution?
Thank you so much! 🙂
Best Answer
Some hints:
ranger
(available incaret
) are fully parallelized. The more cores, the better.max.depth
andnum.trees
help to reduce computational time. Still, they are not ideal. In your situation, maybe 20 minutes withranger
on a normal laptop would be sufficient. (A rough guess).With higher
max.depth
, quite a lot of additional time will be required.