Solved – Evaluating logistic regression and interpretation of Hosmer-Lemeshow Goodness of Fit

goodness of fitlogisticmodel-evaluationrregression-strategies

As we all know, there are 2 methods to evaluate the logistic regression model and
they are testing very different things

  1. Predictive power:

    Get a statistic that measures how well you can predict the dependent variable
    based on the independent variables. The well-known Pseudo R^2 are McFadden
    (1974) and Cox and Snell (1989).

  2. Goodness-of-fit statistics

    The test is telling whether you could do even better by making the model more
    complicated, which is actually testing whether there are any non-linearities or
    interactions that you have missed.

I implemented both tests on my model, which added quadratic and interaction
already:

    >summary(spec_q2)

    Call:
    glm(formula = result ~ Top + Right + Left + Bottom + I(Top^2) + 
     I(Left^2) + I(Bottom^2) + Top:Right + Top:Bottom + Right:Left, 
     family = binomial())

     Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
    (Intercept)  0.955431   8.838584   0.108   0.9139    
    Top          0.311891   0.189793   1.643   0.1003    
    Right       -1.015460   0.502736  -2.020   0.0434 *  
    Left        -0.962143   0.431534  -2.230   0.0258 *  
    Bottom       0.198631   0.157242   1.263   0.2065    
    I(Top^2)    -0.003213   0.002114  -1.520   0.1285    
    I(Left^2)   -0.054258   0.008768  -6.188 6.09e-10 ***
    I(Bottom^2)  0.003725   0.001782   2.091   0.0366 *  
    Top:Right    0.012290   0.007540   1.630   0.1031    
    Top:Bottom   0.004536   0.002880   1.575   0.1153    
    Right:Left  -0.044283   0.015983  -2.771   0.0056 ** 
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    (Dispersion parameter for binomial family taken to be 1)
    Null deviance: 3350.3  on 2799  degrees of freedom
    Residual deviance: 1984.6  on 2789  degrees of freedom
    AIC: 2006.6

and the predicted power is as below, the MaFadden is 0.4004, and the value between 0.2~0.4 should be taken to present very good fit of the model(Louviere et al (2000), Domenich and McFadden (1975)) :

 > PseudoR2(spec_q2)
    McFadden     Adj.McFadden        Cox.Snell       Nagelkerke McKelvey.Zavoina           Effron            Count        Adj.Count 
   0.4076315        0.4004680        0.3859918        0.5531859        0.6144487        0.4616466        0.8489286        0.4712500 
         AIC    Corrected.AIC 
2006.6179010     2006.7125925 

and the goodness-of-fit statistics:

 > hoslem.test(result,phat,g=8)

     Hosmer and Lemeshow goodness of fit (GOF) test

  data:  result, phat
  X-squared = 2800, df = 6, p-value < 2.2e-16

As my understanding, GOF is actually testing the following null and alternative hypothesis:

  H0: The models does not need interaction and non-linearity
  H1: The models needs interaction and non-linearity

Since my models added interaction, non-linearity already and the p-value shows H0 should be rejected, so I came to the conclusion that my model needs interaction, non-linearity indeed. Hope my interpretation is correct and thanks for any advise in advance, thanks.

Best Answer

There are several issues to address.

  • $R^2$ measures by themselves never measure goodness of fit; they measure mainly predictive discrimination. Goodness of fit only comes from comparing $R^2$ with the $R^2$ from a richer model
  • The Hosmer-Lemeshow test is for overall calibration error, not for any particular lack of fit such as quadratic effects. It does not properly take overfitting into account, is arbitrary to choice of bins and method of computing quantiles, and often has power that is too low.
  • For these reasons the Hosmer-Lemeshow test is no longer recommended. Hosmer et al have a better one d.f. omnibus test of fit, implemented in the R rms package residuals.lrm function.
  • For your case goodness of fit can be assessed by jointly testing (in a "chunk" test) the contribution of all the square and interaction terms.
  • But I recommend specifying the model to make it more likely to fit up front (especially with regard to relaxing linearity assumptions using regression splines) and using the bootstrap to estimate overfitting and to get an overfitting-corrected high-resolution smooth calibration curve to check absolute accuracy. These are done using the R rms package.

On the last point, I prefer the philosophy that models be flexible (as limited by the sample size, anyway) and that we concentrate more on "fit" than "lack of fit".