Solved – Effect of features that are highly correlated with each other on a decision tree

boostingcorrelationfeature selectionregularization

I have a dataset of roughly 500 features and am training a binary classifier using GBM – gradient boosted machines, an ensemble of decision trees. Of these 500 variables, I am sure some are highly correlated with each other, though probably not to the extent where they are linearly dependent. For example, one variable might be average age of people in city X which was collected by survey 1, and another variable is the average age of people in city X collected by survey 2. How does such a massive set of features affect the decision trees? In the regression setting, this should increase prediction variance, but can also be mitigated by regularization.

Best Answer

I would expect that if a decision tree used one of the highly correlated variables then it would be less likely to use the other.

So in your ensemble, perhaps in some trees one might be used and sometimes the other.

Related Question