Solved – What are the practical & interpretation differences between alternatives and logistic regression

hypothesis testinglogisticrrandom forest

A recent question about alternatives to logistic regression in R yielded a variety of answers including randomForest, gbm, rpart, bayesglm, and generalized additive models. What are the practical and interpretation differences between these methods and logistic regression? What assumptions do they make (or not make) relative to logistic regression? Are the suitable for hypothesis testing? Etc.

Best Answer

Disclaimer: It is certainly far from being a full answer to the question!

I think there are at least two levels to consider before establishing a distinction between all such methods:

  • whether a single model is fitted or not: This helps opposing methods like logistic regression vs. RF or Gradient Boosting (or more generally Ensemble methods), and also put emphasis on parameters estimation (with associated asymptotic or bootstrap confidence intervals) vs. classification or prediction accuracy computation;
  • whether all variables are considered or not: This is the basis of feature selection, in the sense that penalization or regularization allows to cope with "irregular" data sets (e.g., large $p$ and/or small $n$) and improve generalizability of the findings.

Here are few other points that I think are relevant to the question.

In case we consider several models--the same model is fitted on different subsets (individuals and/or variables) of the available data, or different competitive models are fitted on the same data set--, cross-validation can be used to avoid overfitting and perform model or feature selection, although CV is not limited to this particular cases (it can be used with GAMs or penalized GLMs, for instance). Also, there is the traditional interpretation issue: more complex models often implies more complex interpretation (more parameters, more stringent assumptions, etc.).

Gradient boosting and RFs overcome the limitations of a single decision tree, thanks to Boosting whose main idea is to combine the output of several weak learning algorithms in order to build a more accurate and stable decision rule, and Bagging where we "average" results over resampled data sets. Altogether, they are often viewed as some kind of black boxes in comparison to more "classical" models where clear specifications for the model are provided (I can think of three classes of models: parameteric, semi-parametric, non-parametric), but I think the discussion held under this other thread The Two Cultures: statistics vs. machine learning? provide interesting viewpoints.

Here are a couple of papers about feature selection and some ML techniques:

  1. Saeys, Y, Inza, I, and Larrañaga, P. A review of feature selection techniques in bioinformatics, Bioinformatics (2007) 23(19): 2507-2517.
  2. Dougherty, ER, Hua J, and Sima, C. Performance of Feature Selection Methods, Current Genomics (2009) 10(6): 365–374.
  3. Boulesteix, A-L and Strobl, C. Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction, BMC Medical Research Methodology (2009) 9:85.
  4. Caruana, R and Niculescu-Mizil, A. An Empirical Comparison of Supervised Learning Algorithms. Proceedings of the 23rd International Conference on Machine Learning (2006).
  5. Friedman, J, Hastie, T, and Tibshirani, R. Additive logistic regression: A statistical view of boosting, Ann. Statist. (2000) 28(2):337-407. (With discussion)
  6. Olden, JD, Lawler, JJ, and Poff, NL. Machine learning methods without tears: a primer for ecologists, Q Rev Biol. (2008) 83(2):171-93.

And of course, The Elements of Statistical Learning, by Hastie and coll., is full of illustrations and references. Also be sure to check the Statistical Data Mining Tutorials, from Andrew Moore.