Solved – Should redundant explanatory variables be discarded

classificationmachine learningpredictive-modelsregression

Suppose that we want to fit a model to predict a given response variable $Y$. Suppose that some explanatory variables are redundant. Be an explanatory variable redundant if it gives similar information of another available explanatory variable. For example if $x_1$ is a count variable, a redundant variable would be the indicator function: $x_2=1(x_1>=1)$. If $x_3$ is a non negative quantitative variable, a redundant variable would be the indicator function: $x_4=1(x_3>0)$.
My questions are:

  1. should redundant explanatory variables be discarded from the training
    set? If so, why can't we add all the explanatory variables and let
    the variable selection algorithms (for instance, forward stepwise for
    linear models, MARS building procedure, lasso and CART's variable
    selection ability) choose which variables are worth being inside the
    model and which are not?
  2. if redundant explanatory variables should be removed, what is the correct way to proceed? Should I add only $x_1$ and $x_3$ (without $x_2$ and $x_4)$, see the results, then add only $x_2$ and $x_4$ (without $x_1$ and $x_3)$, see the results and finally decide which predictors to use?

Intuitively, I don't see how the prediction error can decrease by adding more explanatory variables. In other words, I don't understand why we shouldn't consider redundant explanatory variables too.
As far as inference is concerned, if redundant predictors aren't discarded from the training set, can I get contradictory results (for example, in a linear model, a positive regression coefficient for $x_1$ but a negative coefficient for for $x_2$? If so, is this the only reason why redundant predictors should be removed?

Best Answer

Unless predictors are almost completely redundant, better predictive performance results when the competing predictors are combined, as compared to deleting predictors up front. When redundancy is very high, deletion up front can be a good idea. the R Hmisc package redun function is one approach, measuring redundancy using a flexible additive nonlinear model to predict each variable from all the (remaining) variables.

Keep in mind that assessment of redundancy must be done using only unsupervised learning techniques.

Related Question