I really appreciate the pointers to my book and papers and R package. Briefly, stepwise regression is invalid as it destroys all statistical properties of the result as well as faring poorly in predictive accuracy. There is no reason to use ROC curves to guide model selection (if model selection is even a good idea), because we have the optimum measure, the log-likelihood and its variants such as AIC. Thresholds for the dependent variable should be dealt with using ordinal regression instead of making a series of binary models. The Hosmer-Lemeshow test is now considered obsolete by many statisticians as well as the original authors. See the reference below (which proposes a better method, implemented in the rms package).
@ARTICLE{hos97com,
author = {Hosmer, D. W. and Hosmer, T. and {le Cessie}, S. and Lemeshow, S.},
year = 1997,
title = {A comparison of goodness-of-fit tests for the logistic regression
model},
journal = Statistics in Medicine,
volume = 16,
pages = {965-980},
annote = {goodness-of-fit for binary logistic model;difficulty with
Hosmer-Lemeshow statistic being dependent on how groups are
defined;sum of squares test (see cop89unw);cumulative sum test;invalidity of naive
test based on deviance;goodness-of-link function;simulation setup;see sta09sim}
}
See also
@Article{sal09sim,
author = {Stallard, Nigel},
title = {Simple tests for the external validation of mortality prediction scores},
journal = Statistics in Medicine,
year = 2009,
volume = 28,
pages = {377-388},
annote = {low power of older Hosmer-Lemeshow test;avoiding grouping of predicted risks;logarithmic and quadratic test;scaled $\chi^2$ approximation;simulation setup; best power seems to be for the logarithmic (deviance) statistic and for the chi-square statistics that is like the sum of squared errors statistic except that each observation is weighted by $p(1-p)$}
}
The Hosmer-Lemeshow test is to some extent obsolete because it requires arbitrary binning of predicted probabilities and does not possess excellent power to detect lack of calibration. It also does not fully penalize for extreme overfitting of the model. Better methods are available such as
Hosmer, D. W.; Hosmer, T.; le Cessie, S. & Lemeshow, S. A comparison of goodness-of-fit tests for the logistic regression model. Statistics in Medicine, 1997, 16, 965-980. Their new measure is implemented in the R rms
package. More importantly, this kind of assessment just addresses overall model calibration (agreement between predicted and observed) and does not address lack of fit such as improperly transforming a predictor. For that matter, neither does AIC unless you use AIC to compare two models where one is more flexible than the other being tested. I think you are interested in predictive discrimination, for which a generalized $R^2$ measure, supplemented by $c$-index (ROC area) may be more appropriate.
Best Answer
Model 2 has the higher area under the response curve. So it therefore appears to be slightly better.