Solved – Stepwise model selection, Hosmer-Lemeshow statistics and prediction success of model in nested logistic regression in R

logisticmultilevel-analysisr

is it possible to do stepwise (direction = both) model selection in nested binary logistic regression in R? I would also appreciate if you can teach me how to get:

  • Hosmer-Lemeshow statitistic,
  • Odds ratio of the predictors,
  • Prediction success of the model.

I used lme4 package of R. This is the script I used to get the general model with all the independent variables:

nest.reg <- glmer(decision ~ age + education + children + (1|town), family = binomial, data = fish)

where:

  • fish — dataframe
  • decision — 1 or 0, whether the respondent exit or stay, respectively.
  • age, education and children — independent variables.
  • town — random effect (where our respondents are nested)

Now my problem is how to get the best model. I know how to do stepwise model selection but only for linear regression. (step( lm(decision ~ age + education + children, data = fish), direction +"both")). But this could not be used for binary logistic regression right? also when i add (1|town) to the formula to account for the effects of town, I get an error result.

By the way… I'm very much thankful to Manoel Galdino who provided me with the script on how to run nested logistic regression.

Thank you very much for your help.

Best Answer

I really appreciate the pointers to my book and papers and R package. Briefly, stepwise regression is invalid as it destroys all statistical properties of the result as well as faring poorly in predictive accuracy. There is no reason to use ROC curves to guide model selection (if model selection is even a good idea), because we have the optimum measure, the log-likelihood and its variants such as AIC. Thresholds for the dependent variable should be dealt with using ordinal regression instead of making a series of binary models. The Hosmer-Lemeshow test is now considered obsolete by many statisticians as well as the original authors. See the reference below (which proposes a better method, implemented in the rms package).

@ARTICLE{hos97com, author = {Hosmer, D. W. and Hosmer, T. and {le Cessie}, S. and Lemeshow, S.}, year = 1997, title = {A comparison of goodness-of-fit tests for the logistic regression model}, journal = Statistics in Medicine, volume = 16, pages = {965-980}, annote = {goodness-of-fit for binary logistic model;difficulty with Hosmer-Lemeshow statistic being dependent on how groups are defined;sum of squares test (see cop89unw);cumulative sum test;invalidity of naive test based on deviance;goodness-of-link function;simulation setup;see sta09sim} }

See also

@Article{sal09sim, author = {Stallard, Nigel}, title = {Simple tests for the external validation of mortality prediction scores}, journal = Statistics in Medicine, year = 2009, volume = 28, pages = {377-388}, annote = {low power of older Hosmer-Lemeshow test;avoiding grouping of predicted risks;logarithmic and quadratic test;scaled $\chi^2$ approximation;simulation setup; best power seems to be for the logarithmic (deviance) statistic and for the chi-square statistics that is like the sum of squared errors statistic except that each observation is weighted by $p(1-p)$} }