Solved – What happens if I train a model on a data set that includes a duplicated feature

feature selectionmachine learningregressionsupervised learning

The Question

Suppose I train a predictive model on a set of features $x_1, \dots, x_n$, but for some $i \neq j$ we have $x_i = x_j$ for every data point in the training set; i.e. one of these features is a totally redundant copy of the other.

What are the consequences for learning? Does it depend on whether my model is linear or nonlinear? Does it depend on my training algorithm?

More generally, what should I expect if one of the $x_i$'s is a linear combination of the other features for every point in the training set?

My thoughts so far

Suppose the true target function is a noise-free line, $y = w_1 x_1$. Then a basic linear model will learn the parameter $w_1$ exactly. Now I duplicate the feature $x_1$ by creating a copy $x_2 = x_1$. Any combination of weights in the set $\{(\hat{w}_1, \hat{w}_2): \hat{w}_1 + \hat{w}_2 = w_1\}$ will perfectly fit the data. I'm guessing that the training algorithm will influence which particular pair is chosen.

Best Answer

This is an instance of (multi)collinearity.

As you've noted in "my thoughts so far," ordinary linear regression can "distribute" its weight between the two columns as it pleases without changing the objective. Then fits by actually inverting a matrix may fail; if you fit by some general-purpose optimization algorithm, you'll get some arbitrary breakdown between them. This can cause serious problems if you're trying to infer regression weights, but if you just want to learn a predictive model it doesn't really matter (unless your matrix inverse fails).

If you use a regularized linear model, say ridge regression or LASSO, then the weight will be distributed between the two components in a way that corresponds to the choice of regularization.

If you're fitting a decision tree, then it'll split based on one or the other arbitrarily, and it won't really matter unless you're looking at feature importance, where it'll mess you up in the same way as weight inference in linear models.

If you're doing a method based on distances between points, e.g. nearest neighbors, kernel smoothing, or many types of Gaussian processes, then it'll effectively make that component "count more" in the distance.

Related Question