Suppose that we want to fit a model to predict a given response variable $Y$. Suppose that some explanatory variables are redundant. Be an explanatory variable redundant if it gives similar information of another available explanatory variable. For example if $x_1$ is a count variable, a redundant variable would be the indicator function: $x_2=1(x_1>=1)$. If $x_3$ is a non negative quantitative variable, a redundant variable would be the indicator function: $x_4=1(x_3>0)$.
My questions are:
- should redundant explanatory variables be discarded from the training
set? If so, why can't we add all the explanatory variables and let
the variable selection algorithms (for instance, forward stepwise for
linear models, MARS building procedure, lasso and CART's variable
selection ability) choose which variables are worth being inside the
model and which are not? - if redundant explanatory variables should be removed, what is the correct way to proceed? Should I add only $x_1$ and $x_3$ (without $x_2$ and $x_4)$, see the results, then add only $x_2$ and $x_4$ (without $x_1$ and $x_3)$, see the results and finally decide which predictors to use?
Intuitively, I don't see how the prediction error can decrease by adding more explanatory variables. In other words, I don't understand why we shouldn't consider redundant explanatory variables too.
As far as inference is concerned, if redundant predictors aren't discarded from the training set, can I get contradictory results (for example, in a linear model, a positive regression coefficient for $x_1$ but a negative coefficient for for $x_2$? If so, is this the only reason why redundant predictors should be removed?
Best Answer
Unless predictors are almost completely redundant, better predictive performance results when the competing predictors are combined, as compared to deleting predictors up front. When redundancy is very high, deletion up front can be a good idea. the R
Hmisc
packageredun
function is one approach, measuring redundancy using a flexible additive nonlinear model to predict each variable from all the (remaining) variables.Keep in mind that assessment of redundancy must be done using only unsupervised learning techniques.