Solved – Reducing the number of variables in a multiple regression

feature selectionmodel selectionmultiple regressionmultivariate analysisregression

I have a large data set consisting of the values of several hundred financial variables that could be used in a multiple regression to predict the behavior of an index fund over time. I would like to reduce the number of variables to ten or so while still retaining as much predictive power as possible. Added: The reduced set of variables needs to be a subset of the original variable set in order to preserve the economic meaning of the original variables. Thus, for example, I shouldn't end up with linear combinations or aggregates of the original variables.

Some (probably naive) thoughts on how to do this:

  1. Perform a simple linear regression with each variable and choose the ten with the largest $R^2$ values. Of course, there's no guarantee that the ten best individual variables combined would be the best group of ten.
  2. Perform a principal components analysis and try to find the ten original variables with the largest associations with the first few principal axes.

I don't think I can perform a hierarchical regression because the variables aren't really nested. Trying all possible combinations of ten variables is computationally infeasible because there are too many combinations.

Is there a standard approach to tackle this problem of reducing the number of variables in a multiple regression?

It seems like this would be a sufficiently common problem that there would be a standard approach.

A very helpful answer would be one that not only mentions a standard method but also gives an overview of how and why it works. Alternatively, if there is no one standard approach but rather multiple ones with different strengths and weaknesses, a very helpful answer would be one that discusses their pros and cons.

whuber's comment below indicates that the request in the last paragraph is too broad. Instead, I would accept as a good answer a list of the major approaches, perhaps with a very brief description of each. Once I have the terms I can dig up the details on each myself.

Best Answer

This problem is usually called Subset Selection and there are quite a few different approaches. See Google Scholar for an overview over related articles.

Related Question