Solved – In random forest, what happens if I add features that are correlated

random forest

I'm training a random forest, trying to predict market shares of future stores on geographical areas. I have many features for these areas, some of which tell similar but different things about one thing.

For example, I know the total number of accommodations in the area (housings ? not sure of the exact terminology), and I also have 5 others columns which are all linked in the following way :

number of main accommodations + number of secondary accommodations + number of holiday accommodations

= number of houses + number of flats

= accommodations

I have the feeling that including them all in my model would be wrong… but including them might be important… Any hint on how I should handle this ? Would it be a good idea to include accommodations as absolute value and include all the other five but as % (of accommodations) and not as absolute values ?

In a similar fashion, I also have the total number of households of the area, the total income of the area, and the average income of households in the area (so that households * average income = total income). I have the feeling using the average and not the total income would be a better idea, but how can I be sure I'm right ?

(I guess I could train three random forests using the average income only, the total income only, and both, and see how they perform on cross validation, but is there a rule of thumb that I should know of which can make me go faster ?)

Thanks

(In case it's relevant, I'm using R and the randomForest package)

Best Answer

You are asking 2 questions:

1/ How to assess whether 2 sets of features give better results?

You answer is correct, perform cross validation in each case and see which one perform the best and what is the error variance. The rule of thumb is not to include feature that are too correlated (there is no good way to define too correlated, trial and error is the best). 0.9 correlation is usually considered being very high. Also, if you have features of the type A + B = C, you should only include a pair i.e. (A,C), (B,C) or (A,B). But you can include some features like A/C which will describe something different and might not be correlated at all with other variables. Whether you should use (A/C,B/C) or (A,B) completely depends on your problem and a good place to start is your knowledge of the problem and logic (if your variable is more related to % or actual counts...).

2/ Now about adding correlated features.

If you are trying to build explanatory models then it could be bad and you have to be very careful. But this is not what you are trying to do, your model seems to be purely predictive. Thus the main issue when you add correlated features is that you have more or less twice the same information. There are 2 problems with that:

  • Related to random forest. Since random forest samples some features to build each tree, the information contained in correlated features is twice as much likely to be picked than any other information contained in other features. This could be a problem if you have a large % of correlated features.

  • Not Related to random forest. In general, when you are adding correlated features, it means that they linearly contains the same information and thus it will reduce the robustness of your model. Each time you train your model, your model might pick one feature or the other to "do the same job" i.e. explain some variance, reduce entropy, etc... So each time you train, depending on your split of data, you are actually building different models.

Now that being said, correlated features in random forest are usually well handle because of sampling and bagging. Also 2 correlated features might contains very different information and thus both might be crucial to your model, especially in the case of non-linear models like random forest.

Related Question