Solved – High p-value Based on Residual Deviance when Model Appears to have Poor Fit

deviancegoodness of fitlogisticregression

I'm running a logistic regression with R using the glm() function with family = "binomial" and a very large number of observations (37208). Only a very small number of observations have a true result (approximately 1% of the 37208 observations).

In order to check model fit, I used pchisq() to generate a p-value based on Residual Deviance (see output below). This reports '1' indicating that is highly likely that the model is a good fit for the data.

However, looking at the predicted results, the fit doesn't seem all that 'good'. For example, only 53% of the observations predicted to have a true response (i.e. predicted probability > 0.5) actually had a true response (precision), and only 21% of the observations observed to have a true response were predicted (again, predicted probability > 0.5) to have a true response (recall). I experimented with lowering probably threshold from 0.5 to 0.3 – this improved recall but degraded precision. Either way, this seems to indicate the fit isn't that great.

How should I interpret the p-value from the residual deviance in this situation? Why is the p-value reported to be 1? Do the large number of observations combined with the low numbers of true responses somehow make the chi-square fit test misleading? If so, why?

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 5112.9  on 37207  degrees of freedom
Residual deviance: 3329.5  on 37170  degrees of freedom
AIC: 3405.5

Number of Fisher Scoring iterations: 17

> 1 - pchisq(3329.5, 37170)
[1] 1

Best Answer

For a general GLM, the deviance $\Delta$ is defined as $\Delta:=2(\tilde{l}-\hat{l})$ where $\tilde{l}$ and $\hat{l}$ are the loglikelihood of the saturated and our model, respectively.

The general use of the deviance in goodness-of-fit testing for a GLM, with $n$ observations and $p$ parameters in the model, uses that the deviance is approximately chi-squared distributed with $n-p$ degrees of freedom, i.e $\Delta \sim \chi^2_{n-p}$. A large deviance would indicate that our model is "far" from the perfect fitting one. Thus a large value of the deviance would indicate a poorly fitting model. So $P(\Delta<\chi^2_{n-p})$ should be small.

However, for the binomial response distribution in a GLM, the deviance is not always a good measure of fit.

If your data is, or can be grouped, the chi-square approximation will work if both $n_i \hat{\pi_i}>5$ and $n_i(1-\hat{\pi_i})>5$ for each group $i$.

But, if you have binary responses, i.e $y_i$ either 0 or 1, the chi-square approximation will not be correct. Also the deviance will be connected to the actual responses only through the fitted values. How can you assess a goodness of fit with an expression for the deviance only containing estimated values?

A good alternative if you need a test is the Hosmer-Lemeshow test.

When it comes to measuring your models actual prediction power, you could try to make a ROC-curve. It will give you an indication of how well your model is performing.

We can define sensitivity as the relative frequency of predicting an event when an event takes place, i.e guessing right when $y_i=1$, and specificity as the relative frequency of predicting a non-event when there is no event, i.e guessing right when $y_i=0$. Ideally they are both close to 1.

If we estimate our model, calculate the probability for each observation, and according to some threshold value classify it as either an event, or a non-event, we can calculate the sensitivity and specificity of our model at this threshold. A threshold of zero, would yield sensitivity of 0 and specificity of 1, a threshold of 1 yields sensitivity of 1 and specificity of 0. So every ROC-curve starts in (0,0) and ends in (1,1) (as we have 1-specificity on the x-axis). Plotted below is such a curve.

ROC-curve

A model that predicts well will have a sharply rising ROC-curve yielding a high sensitivity and high specificity, the further a curve is toward the top left, the better its predicting power. A model with a ROC-curve corresponding to the $45^{\circ}$-line is a model no better than simply guessing.

In conclusion, the deviance is not a good measure of fit in a binary regression GLM. If you really need a test, use the Hosmer-Lemeshow. If you're interested in the actual prediction capabilities of the model, use an ROC curve. There are several packages in R that will do this for you, pROC is one.

Related Question