I'm trying to understand the difference between these a bit better. I understand pretty well how random forests work but I guess I'm more hazy on rulefit and how exactly it's different. I know rulefit will incorporate linear components and so can fit linear trends better. What other ways do they differ?
Solved – Difference between rulefit and random forest
machine learningrandom forest
Related Solutions
You are asking 2 questions:
1/ How to assess whether 2 sets of features give better results?
You answer is correct, perform cross validation in each case and see which one perform the best and what is the error variance. The rule of thumb is not to include feature that are too correlated (there is no good way to define too correlated, trial and error is the best). 0.9 correlation is usually considered being very high. Also, if you have features of the type A + B = C, you should only include a pair i.e. (A,C), (B,C) or (A,B). But you can include some features like A/C which will describe something different and might not be correlated at all with other variables. Whether you should use (A/C,B/C) or (A,B) completely depends on your problem and a good place to start is your knowledge of the problem and logic (if your variable is more related to % or actual counts...).
2/ Now about adding correlated features.
If you are trying to build explanatory models then it could be bad and you have to be very careful. But this is not what you are trying to do, your model seems to be purely predictive. Thus the main issue when you add correlated features is that you have more or less twice the same information. There are 2 problems with that:
Related to random forest. Since random forest samples some features to build each tree, the information contained in correlated features is twice as much likely to be picked than any other information contained in other features. This could be a problem if you have a large % of correlated features.
Not Related to random forest. In general, when you are adding correlated features, it means that they linearly contains the same information and thus it will reduce the robustness of your model. Each time you train your model, your model might pick one feature or the other to "do the same job" i.e. explain some variance, reduce entropy, etc... So each time you train, depending on your split of data, you are actually building different models.
Now that being said, correlated features in random forest are usually well handle because of sampling and bagging. Also 2 correlated features might contains very different information and thus both might be crucial to your model, especially in the case of non-linear models like random forest.
There was a PhD on random forests which included RF on large datasets - it is available at https://github.com/glouppe/phd-thesis I no longer remember the solution proposed but I remember that there was some comparative experiments on different alternatives.
Best Answer
In fact, RuleFit does excessive pruning on a random forest. It tries to find a set of rules generated by random forest to obtain accuracy as close as possible to the accuracy of random forest while reducing the number of rules tremendously. Finally, it builds a model consisting of simple and short rules which are extracted from random forest and builds a comprehensive and understandable model from random forest which is a black box model. How ? It builds a linear model from random forest rules and using an optimization method (Lasso) finds a sparse weight vector that determines which rules are the most important ones. At the end few rules have non-zero weights and the rest of the rules are removed from the ensemble. There are also similar methods with the same aim such as NodeHarvest, but RuleFit has better performance.