Solved – Stepwise model selection, Hosmer-Lemeshow statistics and prediction success of model in nested logistic regression in R

logisticmultilevel-analysisr

is it possible to do stepwise (direction = both) model selection in nested binary logistic regression in R? I would also appreciate if you can teach me how to get:

Hosmer-Lemeshow statitistic,
Odds ratio of the predictors,
Prediction success of the model.

I used lme4 package of R. This is the script I used to get the general model with all the independent variables:

nest.reg <- glmer(decision ~ age + education + children + (1|town), family = binomial, data = fish)

where:

fish — dataframe
decision — 1 or 0, whether the respondent exit or stay, respectively.
age, education and children — independent variables.
town — random effect (where our respondents are nested)

Now my problem is how to get the best model. I know how to do stepwise model selection but only for linear regression. (step( lm(decision ~ age + education + children, data = fish), direction +"both")). But this could not be used for binary logistic regression right? also when i add (1|town) to the formula to account for the effects of town, I get an error result.

By the way… I'm very much thankful to Manoel Galdino who provided me with the script on how to run nested logistic regression.

Thank you very much for your help.

Best Answer

I really appreciate the pointers to my book and papers and R package. Briefly, stepwise regression is invalid as it destroys all statistical properties of the result as well as faring poorly in predictive accuracy. There is no reason to use ROC curves to guide model selection (if model selection is even a good idea), because we have the optimum measure, the log-likelihood and its variants such as AIC. Thresholds for the dependent variable should be dealt with using ordinal regression instead of making a series of binary models. The Hosmer-Lemeshow test is now considered obsolete by many statisticians as well as the original authors. See the reference below (which proposes a better method, implemented in the rms package).

@ARTICLE{hos97com, author = {Hosmer, D. W. and Hosmer, T. and {le Cessie}, S. and Lemeshow, S.}, year = 1997, title = {A comparison of goodness-of-fit tests for the logistic regression model}, journal = Statistics in Medicine, volume = 16, pages = {965-980}, annote = {goodness-of-fit for binary logistic model;difficulty with Hosmer-Lemeshow statistic being dependent on how groups are defined;sum of squares test (see cop89unw);cumulative sum test;invalidity of naive test based on deviance;goodness-of-link function;simulation setup;see sta09sim} }

Related Solutions

Solved – Incorporating random effects in the logistic regression formula in R

Short answer is you can't - well, not without recoding a version of stepAIC() that knows how to handle S4 objects. stepAIC() knows nothing about lmer() and glmer() models, and there is no equivalent code in lme4 that will allow you to do this sort of stepping.

I also think your whole process needs carefully rethinking - why should there be the one best model? AIC could be used to identify several candidate models that do similar jobs and average those models, rather than trying to find the best model for your sample of data.

Selection via AIC is effectively doing multiple testing - but how should you correct the AIC to take into account the fact that you are doing all this testing? How do you interpret the precision of the coefficients for the final model you might select?

A final point; don;t do all the as.factor() in the model formula as it just makes the whole thing a mess, takes up a lot of space and doesn't aid understanding of the model you fitted. Get the data in the correct format first, then fit the model, e.g.:

RShifting <- transform(RShifting,
                       Age = as.factor(Age),
                       Educ = as.factor(Educ),
                       Child = as.factor(Child))

then

glmer(decision ~ Age + Educ + Child + (1|town), family=binomial, 
      data=RShifting)

Apart from making things far more readable, it separates the tasks of data processing from the data analysis steps.

Regression – Comparing Hosmer-Lemeshow Test vs AIC for Logistic Regression

The Hosmer-Lemeshow test is to some extent obsolete because it requires arbitrary binning of predicted probabilities and does not possess excellent power to detect lack of calibration. It also does not fully penalize for extreme overfitting of the model. Better methods are available such as Hosmer, D. W.; Hosmer, T.; le Cessie, S. & Lemeshow, S. A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine, 1997, 16, 965-980. Their new measure is implemented in the R rms package. More importantly, this kind of assessment just addresses overall model calibration (agreement between predicted and observed) and does not address lack of fit such as improperly transforming a predictor. For that matter, neither does AIC unless you use AIC to compare two models where one is more flexible than the other being tested. I think you are interested in predictive discrimination, for which a generalized $R^2$ measure, supplemented by $c$-index (ROC area) may be more appropriate.

Best Answer

Related Solutions

Solved – Incorporating random effects in the logistic regression formula in R

Regression – Comparing Hosmer-Lemeshow Test vs AIC for Logistic Regression

Related Question