Solved – Linear regression for feature selection

feature selectionregressionresiduals

Imagine we regress y on x1x4. Now, we want to find out if x5 is a stronger predictor than x6 (given the other variables). Note that all variables are scaled.

Would it be okay to use the residuals to see which one would be a stronger predictor?

y <- scale(rnorm(1000))
x <- scale(replicate(6, rnorm(1000)))

# Method 1:
res = lm(y ~ x[,1:4])$residuals 
lm(res ~ x[,5] - 1)
lm(res ~ x[,6] - 1)

The goal here is to identify which variable is a stronger predictor (taking into account the other variables). As far as I can see, this indeed delivers different results from simply correlating x5 and x6 with y (method 2) in turn.

The benefit of doing it this way is that it would be less computationally expensive (with high amount of predictors) to compute rather than computing the whole equation.

Also, the results still differ a bit from when we would compute them all at once, that is lm(y ~ x[,1:5]) and lm(y ~ x[,c(1:4,6)]) separately (method 3).

results                   x5           x6
explain residuals    -0.003126777 -0.008349196
cor(x[,5:6], y)      -0.003499607 -0.006773532
explain at once      -0.003137124 -0.008407007

So: is there any kind of a shortcut that could produce the latter model without having to compute the large model?

What would be the advice for feature selection? Is explaining the residuals a good approximation of how good the model would be including x5 or x6 from the start?

Added some benchmark results (10000x1002 matrix):

              x1001      x1002     time taken
method1     -0.01515   -0.00967       16s  
method2     -0.01690   -0.01170    0.001s
method3     -0.01689   -0.01068       32s

This might actually suggest that cor() might be good enough, or does this have to do with the fact that here all x's are independent of each other, while in reality this is most likely not the case?

Best Answer

To answer this question of yours: "The goal here is to identify which variable is a stronger predictor"

I think you can play around with MSE and variables to find the most important variable. Here are the steps I would follow:

a. Add all variables and regress the model. You will obtain the weights and an MSE.

b. One-by-one remove the variables and regress the model again. You will obtain new weights and a new MSE.

c. Find the model which resulted in the lest MSE.

d. Since all variables are scaled, the variable with the largest weight has the highest importance. This final steps answers this question of yours: "What would be the advice for feature selection?"

  1. Such analysis can be performed easily through MATLAB "Curve-Fitting" Toolbox.

Since approach (1) is expensive when the number of variables are high, there are other techniques which give you the importance of the variables (i.e. ranking) like:

  1. Use Neural Networks and then read the "weights" of the neurons.
  2. (almost same as neural networks) Self-Organizing Maps (SOM) and then read the "weights" of the neurons.
  3. RReliefF is another option.
  4. Pearson Correlation
  5. Spearman Correlation
  6. Kendall Correlation
  7. Mutual Information etc.
Related Question