Solved – Variable selection for predictive modeling really needed in 2016

feature selectionmachine learningmodel selectionmodelingprediction

This question has been asked on CV some yrs ago, it seems worth a repost in light of 1) order of magnitude better computing technology (e.g. parallel computing, HPC etc) and 2) newer techniques, e.g. [3].

First, some context. Let's assume the goal is not hypothesis testing, not effect estimation, but prediction on un-seen test set. So, no weight is given to any interpretable benefit. Second, let's say you cannot rule out the relevance of any predictor on subject matter consideration, ie. they all seem plausible individually or in combination with other predictors. Third, you're confront with (hundreds of) millions of predictors. Fourth, let's say you have access to AWS with an unlimited budget, so computing power is not a constraint.

The usual reaons for variable selection are 1) efficiency; faster to fit a smaller model and cheaper to collect fewer predictors, 2) interpretation; knowing the "important" variables gives insight into the underlying process [1].

It's now widely known that many variable selection methods are ineffective and often outright dangerous (e.g. forward stepwise regression) [2].

Secondly, if the selected model is any good, one shouldn't need to cut down on the list of predictors at all. The model should do it for you. A good example is lasso, which assigns a zero coefficient to all the irrelevant variables.

I'm aware that some people advocate using an "elephant" model, ie. toss every conceivable predictors into the fit and run with it [2].

Is there any fundamental reason to do variable selection if the goal is predictive accuracy?

[1] Reunanen, J. (2003). Overfitting in making comparisons between variable selection methods. The Journal of Machine Learning Research, 3, 1371-1382.

[2] Harrell, F. (2015). Regression modeling strategies: with applications to linear models, logistic and ordinal regression, and survival analysis. Springer.

[3] Taylor, J., & Tibshirani, R. J. (2015). Statistical learning and selective inference. Proceedings of the National Academy of Sciences, 112(25), 7629-7634.

[4] Zhou, J., Foster, D., Stine, R., & Ungar, L. (2005, August). Streaming feature selection using alpha-investing. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (pp. 384-393). ACM.

Best Answer

There have been rumors for years that Google uses all available features in building its predictive algorithms. To date however, no disclaimers, explanations or white papers have emerged that clarify and/or dispute this rumor. Not even their published patents help in the understanding. As a result, no one external to Google knows what they are doing, to the best of my knowledge.

/* Update in Sept 2019, a Google Tensorflow evangelist went on record in a presentation in stating that Google engineers regularly evaluate over 5 billion parameters for the current version of PageRank. */

As the OP notes, one of the biggest problems in predictive modeling is the conflation between classic hypothesis testing and careful model specification vs pure data mining. The classically trained can get quite dogmatic about the need for "rigor" in model design and development. The fact is that when confronted with massive numbers of candidate predictors and multiple possible targets or dependent variables, the classic framework neither works, holds nor provides useful guidance. Numerous recent papers delineate this dilemma from Chattopadhyay and Lipson's brilliant paper Data Smashing: Uncovering Lurking Order in Data http://rsif.royalsocietypublishing.org/content/royinterface/11/101/20140826.full.pdf

The key bottleneck is that most data comparison algorithms today rely on a human expert to specify what ‘features’ of the data are relevant for comparison. Here, we propose a new principle for estimating the similarity between the sources of arbitrary data streams, using neither domain knowledge nor learning.

To last year's AER paper on Prediction Policy Problems by Kleinberg, et al.https://www.aeaweb.org/articles?id=10.1257/aer.p20151023 which makes the case for data mining and prediction as useful tools in economic policy making, citing instances where "causal inference is not central, or even necessary."

The fact is that the bigger, $64,000 question is the broad shift in thinking and challenges to the classic hypothesis-testing framework implicit in, e.g., this Edge.org symposium on "obsolete" scientific thinking https://www.edge.org/responses/what-scientific-idea-is-ready-for-retirement as well as this recent article by Eric Beinhocker on the "new economics" which presents some radical proposals for integrating widely different disciplines such as behavioral economics, complexity theory, predictive model development, network and portfolio theory as a platform for policy implementation and adoption https://evonomics.com/the-deep-and-profound-changes-in-economics-thinking/ Needless to say, these issues go far beyond merely economic concerns and suggest that we are undergoing a fundamental shift in scientific paradigms. The shifting views are as fundamental as the distinctions between reductionistic, Occam's Razor like model-building vs Epicurus' expansive Principle of Plenitude or multiple explanations which roughly states that if several findings explain something, retain them all ... https://en.wikipedia.org/wiki/Principle_of_plenitude

Of course, guys like Beinhocker are totally unencumbered with practical, in the trenches concerns regarding applied, statistical solutions to this evolving paradigm. Wrt the nitty-gritty questions of ultra-high dimensional variable selection, the OP is relatively nonspecific regarding viable approaches to model building that might leverage, e.g., Lasso, LAR, stepwise algorithms or "elephant models” that use all of the available information. The reality is that, even with AWS or a supercomputer, you can't use all of the available information at the same time – there simply isn’t enough RAM to load it all in. What does this mean? Workarounds have been proposed, e.g., the NSF's Discovery in Complex or Massive Datasets: Common Statistical Themes to "divide and conquer" algorithms for massive data mining, e.g., Wang, et al's paper, A Survey of Statistical Methods and Computing for Big Data http://arxiv.org/pdf/1502.07989.pdf as well as Leskovec, et al's book Mining of Massive Datasets http://www.amazon.com/Mining-Massive-Datasets-Jure-Leskovec/dp/1107077230/ref=sr_1_1?ie=UTF8&qid=1464528800&sr=8-1&keywords=Mining+of+Massive+Datasets

There are now literally hundreds, if not thousands of papers that deal with various aspects of these challenges, all proposing widely differing analytic engines as their core from the “divide and conquer” algorithms; unsupervised, "deep learning" models; random matrix theory applied to massive covariance construction; Bayesian tensor models to classic, supervised logistic regression, and more. Fifteen years or so ago, the debate largely focused on questions concerning the relative merits of hierarchical Bayesian solutions vs frequentist finite mixture models. In a paper addressing these issues, Ainslie, et al. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.197.788&rep=rep1&type=pdf came to the conclusion that the differing theoretical approaches, in practice, produced largely equivalent results with the exception of problems involving sparse and/or high dimensional data where HB models had the advantage. Today with the advent of D&C workarounds, any arbitrage HB models may have historically enjoyed are being eliminated.

The basic logic of these D&C workarounds are, by and large, extensions of Breiman's famous random forest technique which relied on bootstrapped resampling of observations and features. Breiman did his work in the late 90s on a single CPU when massive data meant a few dozen gigs and a couple of thousand features. On today's massively parallel, multi-core platforms, it is possible to run algorithms analyzing terabytes of data containing tens of millions of features building millions of "RF" mini-models in a few hours.

There are any number of important questions coming out of all of this. One has to do with a concern over a loss of precision due to the approximating nature of these workarounds. This issue has been addressed by Chen and Xie in their paper, A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data http://dimacs.rutgers.edu/TechnicalReports/TechReports/2012/2012-01.pdf where they conclude that the approximations are indistinguishably different from the "full information" models.

A second concern which, to the best of my knowledge hasn't been adequately addressed by the literature, has to do with what is done with the results (i.e., the "parameters") from potentially millions of predictive mini-models once the workarounds have been rolled up and summarized. In other words, how does one execute something as simple as "scoring" new data with these results? Are the mini-model coefficients to be saved and stored or does one simply rerun the d&c algorithm on new data?

In his book, Numbers Rule Your World, Kaiser Fung describes the dilemma Netflix faced when presented with an ensemble of only 104 models handed over by the winners of their competition. The winners had, indeed, minimized the MSE vs all other competitors but this translated into only a several decimal place improvement in accuracy on the 5-point, Likert-type rating scale used by their movie recommender system. In addition, the IT maintenance required for this ensemble of models cost much more than any savings seen from the "improvement" in model accuracy.

Then there's the whole question of whether "optimization" is even possible with information of this magnitude. For instance, Emmanuel Derman, the physicist and financial engineer, in his book My Life as a Quant suggests that optimization is an unsustainable myth, at least in financial engineering.

Finally, important questions concerning relative feature importance with massive numbers of features have yet to be addressed.

There are no easy answers wrt questions concerning the need for variable selection and the new challenges opened up by the current, Epicurean workarounds remain to be resolved. The bottom line is that we are all data scientists now.

**** EDIT *** References

  1. Chattopadhyay I, Lipson H. 2014 Data smashing: uncovering lurking order in data. J. R. Soc. Interface 11: 20140826. http://dx.doi.org/10.1098/rsif.2014.0826

  2. Kleinberg, Jon, Jens Ludwig, Sendhil Mullainathan and Ziad Obermeyer. 2015. "Prediction Policy Problems." American Economic Review, 105(5): 491-95. DOI: 10.1257/aer.p20151023

  3. Edge.org, 2014 Annual Question : WHAT SCIENTIFIC IDEA IS READY FOR RETIREMENT? https://www.edge.org/responses/what-scientific-idea-is-ready-for-retirement

  4. Eric Beinhocker, How the Profound Changes in Economics Make Left Versus Right Debates Irrelevant, 2016, Evonomics.org. https://evonomics.com/the-deep-and-profound-changes-in-economics-thinking/

  5. Epicurus principle of multiple explanations: keep all models. Wikipedia https://www.coursehero.com/file/p6tt7ej/Epicurus-Principle-of-Multiple-Explanations-Keep-all-models-that-are-consistent/

  6. NSF, Discovery in Complex or Massive Datasets: Common Statistical Themes, A Workshop funded by the National Science Foundation, October 16-17, 2007 https://www.nsf.gov/mps/dms/documents/DiscoveryInComplexOrMassiveDatasets.pdf

  7. Statistical Methods and Computing for Big Data, Working Paper by Chun Wang, Ming-Hui Chen, Elizabeth Schifano, Jing Wu, and Jun Yan, October 29, 2015 http://arxiv.org/pdf/1502.07989.pdf

  8. Jure Leskovec, Anand Rajaraman, Jeffrey David Ullman, Mining of Massive Datasets, Cambridge University Press; 2 edition (December 29, 2014) ISBN: 978-1107077232

  9. Large Sample Covariance Matrices and High-Dimensional Data Analysis (Cambridge Series in Statistical and Probabilistic Mathematics), by Jianfeng Yao, Shurong Zheng, Zhidong Bai, Cambridge University Press; 1 edition (March 30, 2015) ISBN: 978-1107065178

  10. RICK L. ANDREWS, ANDREW AINSLIE, and IMRAN S. CURRIM, An Empirical Comparison of Logit Choice Models with Discrete Versus Continuous Representations of Heterogeneity, Journal of Marketing Research, 479 Vol. XXXIX (November 2002), 479–487 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.197.788&rep=rep1&type=pdf

  11. A Split-and-Conquer Approach for Analysis of Extraordinarily Large Data, Xueying Chen and Minge Xie, DIMACS Technical Report 2012-01, January 2012 http://dimacs.rutgers.edu/TechnicalReports/TechReports/2012/2012-01.pdf

  12. Kaiser Fung, Numbers Rule Your World: The Hidden Influence of Probabilities and Statistics on Everything You Do, McGraw-Hill Education; 1 edition (February 15, 2010) ISBN: 978-0071626538

  13. Emmanuel Derman, My Life as a Quant: Reflections on Physics and Finance, Wiley; 1 edition (January 11, 2016) ISBN: 978-0470192733

* Update in November 2017 *

Nathan Kutz' 2013 book, Data-Driven Modeling & Scientific Computation: Methods for Complex Systems & Big Data is a mathematical and PDE-focused excursion into variable selection as well as dimension reduction methods and tools. An excellent, 1 hour introduction to his thinking can be found in this June 2017 Youtube video Data Driven Discovery of Dynamical Systems and PDEs . In it, he makes references to the latest developments in this field. https://www.youtube.com/watch?feature=youtu.be&v=Oifg9avnsH4&app=desktop

Related Question