Solved – Residual deviance, residuals, and log-likelihood in [weighted] logistic regression

logisticr

I would like to ask a question about the relationship between deviance, residuals, and log-likelihood in logistic regression.
I'm currently fitting a logistic regression with a moderately sized data (N>300k). As far as I've known, residual deviance equals to -2 times log-likelihood, and it also equals to the sum of squared residuals of the regression model I fit.

I observed a weird result from my data, here is my code:

xnam <- "ns(ym,11)+as.factor(sex)+as.factor(m_edu)+as.factor(mage)+as.factor(ges)+as.factor(parity)"
mlist.form <- as.formula(paste('lbw ~', 'pm10_w + ', xnam, sep=''))

mod0 <- glm(formula = mlist.form, data = data.used, family = binomial(link='logit'))

mod0$deviance  # 2704.049
sum(mod0$residuals ^2)  # 1866549
logLik(mod0)  # 'log Lik.' -1352.025 (df=24)

In my example, the sum of squared residuals is not the same as the residual deviance, but the residual deviance equals to the -2 times of log-likelihood.

But the more weird thing is my previous knowledge is confirmed in the small dataset like mtcars.

data(mtcars)
mtcars <- as.data.frame(mtcars)

m1 <- glm(am ~ hp + wt, data =mtcars, family = binomial("logit"))

m1$deviance           #10.05911 (residual deviance) = -2*log likelihood (lnL)
m1$aic                   #16.05911: -2*lnL + 2*k
m1$deviance + 2*3    #16.05911
sum(resid(m1)^2)      #10.059110

I have no information that describes there is a relationship between the data size and the model fit. Could anyone explain the reason of such weird results?

Thank you.

Best Answer

residual deviance equals to -2 times log-likelihood, and it also equals to the sum of squared residuals of the regression model I fit.

The second statement is only true in general for ordinary least squares/linear regression/MLE with Gaussian residuals ... however, there are a variety of different ways of computing residuals. The residuals that are stored as $residuals in the fit are the working residuals, which are the residuals on the response scale translated back to the link scale ( see, deep within the guts of glm.fit(): residuals <- (y - mu)/mu.eta(eta).) On the other hand, residuals() returns the deviance residuals, which are the signed square root of the deviance due to a particular value - in particular, the deviance residuals are defined so that their sum of squares is equal to the overall deviance (see e.g. here). (This is definitely one of the cases where the R documentation assumes that you're an expert in a particular field of statistics [GLMs in this case] and that all you need to know is how to run the machinery in R.)

d <- data.frame(y=c(1,0,0,0,0,1))
m <- glm(y~1, data=d, family=binomial)
m$residuals  ## working residuals
##    1    2    3    4    5    6 
##  3.0 -1.5 -1.5 -1.5 -1.5  3.0 
residuals(m)  ## deviance residuals
##          1          2          3          4          5          6 
##  1.4823038 -0.9005166 -0.9005166 -0.9005166 -0.9005166  1.4823038 
residuals(m,"working")
##    1    2    3    4    5    6 
##  3.0 -1.5 -1.5 -1.5 -1.5  3.0 
residuals(m,"deviance")
##          1          2          3          4          5          6 
##  1.4823038 -0.9005166 -0.9005166 -0.9005166 -0.9005166  1.4823038 

(You can also compare Pearson residuals, which are difference between observed and expected scaled by the expected variance, or response residuals, which are just $y_i-\mu_i$.)

There are a couple of general lessons here:

  • it is much better to use accessor methods such as residuals() rather than extracting components of objects with $ or @: the accessor methods are documented and are more or less guaranteed not to change, whereas maintainers could change the internal structure of an object any time they want (admittedly very unlikely to happen for glm objects ...)
  • in general there is a one-to-many mapping for concepts and quantities defined for linear models (residuals, $R^2$, etc.) to the analogous quantities for GLMs. In particular, ?residuals.glm mentions that the options for the type argument are: deviance (default), pearson (raw residuals scaled by standard deviation, working (see above), response ("raw" residuals, i.e. $y-\mu$, and partial ("a matrix of working residuals, with each column formed by omitting a term from the model").
Related Question