Solved – Goodness-of-fit test in Logistic regression; which ‘fit’ do we want to test

hosmer-lemeshow-testhypothesis testinglogisticpredictive-modelsregression-strategies

I am referring to the question and its answers: How to compare (probability) predictive ability of models developed from logistic regression? by @Clark Chong and answers/comments by @Frank Harrell. and to the question Degrees of freedom of $\chi^2$ in Hosmer-Lemeshow test and the comments.

I have read the paper D.W. Hosmer, T. Hosmer, S. Le Cessie, S. Lemeshow, "A comparison of goodness-of-fit tests for the Logistic regression model", Statistics in Medicine, Vol. 16, 965-980 (1997).

After reading I was confused because the question I referred to asks explicitly for "(probability) predictive ability", which is in my opinion not the same as what the goodness-of-fit-tests in the paper supra aim at:

As most of us know, the logistic regression assumes an S-shaped link between the explanatory variables and the probability of success, the functional form for the S-shape is

$P(y=1|_{x_i})=\frac{1}{1+e^{-(\beta_0+\sum_i \beta_i x_i)}}$

Without pretending that there are no shortcomings with the Hosmer–Lemeshow test, I think that we have to distinguish between tests for the (a) '(probability) predictive ability' and (b) 'goodness-of-fit'.

The former's goal is to test whether the probabilities are well predicted, while the goodness-of-fit tests test whether the S-shaped function above is the 'right' function. More formally:

  1. tests for 'probability predictive ability tests' have a $H_0$ stating that the success probabilities are well predicted by the model;
  2. while for goodness-of-fit tests $H_0$ is (see Hosmer et. al. ) that the S-shaped functional form supra is the correct one. Hosmer et al. perform simulations where they find the power to detect two types of deviations from the null namely that the link function is wrong or that the exponent in the denominator is not linear.

Obviously, if the above function has the 'right' functional form (so if the tests concludes that we can accept $H_0$ for the goodness-of-fit test), then the predicted probabilities will be fine, …

First remark

…however, accepting the $H_0$ is a weak conclusion as explained in What follows if we fail to reject the null hypothesis?.

First question

The most important question/remark that I have is that if the goodness-of-fit $H_0$ is rejected, then the conclusion of the test is that the functional form was not the 'right' one, however, does this imply that the probabilities are not well predicted ?

Second question

Furthermore, I want to point to the conclusions of Hosmer et. al; (I cite from the abstract):

''An examination of the performance of the tests when the correct model has a quadratic term but a model containing only the linear term has been fit shows that the Pearson chi-square, the unweighted sum-of-squares,the Hosmer-Lemeshow decile of risk, the smoothed residual sum-of-squares and Stukel’s score test,have power exceeding 50 per cent to detect moderate departures from linearity when the sample size is 100 and have power over 90 per cent for these same alternatives for samples of size 500. All tests had no power when the correct model had an interaction between a dichotomous and continuous covariate but only the
continuous covariate model was fit. Power to detect an incorrectly specifed link was poor for samples of size 100. For samples of size 500 Stukel's score test had the best power but it only exceeded 50 per cent to detect an asymmetric link function. The power of the unweighted sum-of-squares test to detect an incorrectly specifed link function was slightly less than Stukel’s score test
''

Can I conclude from this which test has more power or that Hosmer–Lemeshow has less power (to detect these specific anomalies) ?

Second remark

The paper by Hosmer et. al. that I referred to supra, compute (simulate) the power to detect specific anomalies (the power can only be computed if an $H_1$ is specified). This does at my opinion not imply that these results can be generalised to ''all possible alternatives $H_1$'' ?

Best Answer

"Goodness of fit" is sometimes used in one sense as the contrary of evident model mis-specification, "lack of fit"; & sometimes in another sense as a model's predictive performance—how well predictions match to observations. The Hosmer–Lemeshow test is for goodness of fit in the first sense, & although evidence of lack of fit suggests predictive performance (GoF in the second sense, measured by say Nagelkerke's $R^2$ or Brier scores) could be improved, you're none the wiser as to how or by how much until you try out specific improvements (typically by including interaction terms, or a spline or polynomial basis for representing continuous predictors to allow for a curvilinear relationship with the logit; sometimes by changing the link).

Goodness-of-fit tests are intended to have reasonable power against a variety of alternatives, rather than high power against a specific alternative; so people comparing of the power of different tests tends to take the pragmatic approach of picking a few alternatives that are thought to be of particular interest to potential users (see for example the frequently cited Stephens (1974), "EDF statistics for goodness of fit & some comparisons", JASA, 69, 347). You can't conclude that one test is more powerful than another against all possible alternatives because it's more powerful against some.