I had a discussion with my colleague some time ago and he stated that a correlation between the predictor variable and the outcome variable is not a good measure to perform feature selection, especially when it comes to tree-based classification/regression methods. His justification for that opinion was that the certain predictor $A$ might not be correlated to the outcome $Y$ at all, but after a split by some other predictor $B$ (remember we are talking about tree-based methods), predictor $A$ may reveal a great predictive power. Just as a note – let's say it does not necessarily to be correlation, what he actually meant is any pre-training (filter approach) measure that scores one predictor variable at a time.
For that time, I agreed with him because it seems quite intuitive, but when I think about this now, I can't really create an example to prove this statement. When I think about it or try to create an example, I get an impression that if $B$ is able to split the $A$ in a way that it becomes greatly correlated with $Y$ (within the subsets defined by $B$), then $B$ itself is in fact very highly correlated with $Y$, so there's probably no need to use $A$ anyway. On the other hand, the statement must be somehow true, because it is basically how the tree-based methods work. So maybe there is a possibility in correlation improvement, but only in a small amount and both of the variables have to be meaningfully correlated to $Y$ anyway (resulting in a fact that feature elimination due to low correlation would still make sense)?
So my main question is: is the statement true, and if it is, could you provide an example?
I have one side question as well: is it possible to combine two predictor variables, that are very weakly correlated to the outcome variable, in a way that the engineered third predictor (e.g. $C = A*B$ as a silly example) will be very strongly correlated to the outcome variable?
Best Answer
Here's an example that should be in everyone's toolbox regarding regression/decision trees, as it succinctly makes two very important points.
The vertical axis shows a response $y$. There are two predictor variables, $x_1$ ranges from $-1$ to $1$ continuously. The relationship between $y$ and $x_1$ is constructed so that no matter how you partition the $x_1$ axis, the response always averages out to zero. In particular:
Or, for all intents and purposes, zero.
The other variable $x_2$ allows you to distinguish the "arm" of the $X$ shaped data.
This data has two very interesting features:
The true or correct decision tree for this data splits on $x_2$ first. This allows it to distinguish the arms of the $X$ shape, and immediately breaks the zero correlation structure between $x_1$ and $y$ for all sub-splits.
A greedy algorithm will generally not find the optimal fist split (you could get very lucky, but probably it will find some noisy split on $x_1$)! The first split on $x_2$ does not lead to a reduction in squared error, it is only important because it lets later splits focus on the association between $x_1$ and $y$.
Here is the R code I used to make this plot, so you can experiment yourself