Solved – Feature selection – correlation ranking for tree-based methods

cartclassificationcorrelationfeature selectionnon-independent

I had a discussion with my colleague some time ago and he stated that a correlation between the predictor variable and the outcome variable is not a good measure to perform feature selection, especially when it comes to tree-based classification/regression methods. His justification for that opinion was that the certain predictor $A$ might not be correlated to the outcome $Y$ at all, but after a split by some other predictor $B$ (remember we are talking about tree-based methods), predictor $A$ may reveal a great predictive power. Just as a note – let's say it does not necessarily to be correlation, what he actually meant is any pre-training (filter approach) measure that scores one predictor variable at a time.

For that time, I agreed with him because it seems quite intuitive, but when I think about this now, I can't really create an example to prove this statement. When I think about it or try to create an example, I get an impression that if $B$ is able to split the $A$ in a way that it becomes greatly correlated with $Y$ (within the subsets defined by $B$), then $B$ itself is in fact very highly correlated with $Y$, so there's probably no need to use $A$ anyway. On the other hand, the statement must be somehow true, because it is basically how the tree-based methods work. So maybe there is a possibility in correlation improvement, but only in a small amount and both of the variables have to be meaningfully correlated to $Y$ anyway (resulting in a fact that feature elimination due to low correlation would still make sense)?

So my main question is: is the statement true, and if it is, could you provide an example?

I have one side question as well: is it possible to combine two predictor variables, that are very weakly correlated to the outcome variable, in a way that the engineered third predictor (e.g. $C = A*B$ as a silly example) will be very strongly correlated to the outcome variable?

Best Answer

Here's an example that should be in everyone's toolbox regarding regression/decision trees, as it succinctly makes two very important points.

X shaped data, with the arm of the X controlled by a factor

The vertical axis shows a response $y$. There are two predictor variables, $x_1$ ranges from $-1$ to $1$ continuously. The relationship between $y$ and $x_1$ is constructed so that no matter how you partition the $x_1$ axis, the response always averages out to zero. In particular:

> cor(df$x_2, df$y)
[1] -0.001792121

Or, for all intents and purposes, zero.

The other variable $x_2$ allows you to distinguish the "arm" of the $X$ shaped data.

This data has two very interesting features:

  • The true or correct decision tree for this data splits on $x_2$ first. This allows it to distinguish the arms of the $X$ shape, and immediately breaks the zero correlation structure between $x_1$ and $y$ for all sub-splits.

  • A greedy algorithm will generally not find the optimal fist split (you could get very lucky, but probably it will find some noisy split on $x_1$)! The first split on $x_2$ does not lead to a reduction in squared error, it is only important because it lets later splits focus on the association between $x_1$ and $y$.

Here is the R code I used to make this plot, so you can experiment yourself

df <- data.frame(
  x_1 = c(seq(-1, 1, length.out=250), seq(1, -1, length.out=250)),
  x_2 = rep(c(1, -1), each=250)
)
df$y <- df$x_1*df$x_2 + rnorm(500, 0, .05)

library(ggplot2)
ggplot(data=df) + geom_point(aes(x=x_1, y=y, color=factor(x_2)))
Related Question