Just stumbled accross the problem that McFaddens Pseudo-$R^2$ yields different values in logistic regression, depending on the grouping of the data. When predictor values occur more than once, there are two ways to specify the model:
- as binary data with each response being 0 or 1
- as data grouped by predictor values with frequencies and weights
Both ways to specify the model in R's glm yield exactly the same coefficients and confidence intervals, but different values for McFadden's Pseudo-$R^2$.
Why is this so? Does this not make McFadden's Pseudo-$R^2$ a very dubious $R^2$ candidate? Or is this not a bug, but a feature?
Let me illustrate the problem by an example with the dataset menarche in the MASS package from vanilla R:
fit.grouped <- glm(Menarche/Total ~ Age, binomial, weights=Total, data=menarche)
fit.grouped.0 <- glm(Menarche/Total ~ 1, binomial, weights=Total, data=menarche)
dat.ungrouped <- data.frame(Age=c(rep(menarche$Age, menarche$Menarche),
rep(menarche$Age, menarche$Total-menarche$Menarche)),
Menarche=c(rep(1, sum(menarche$Menarche)),
rep(0, sum(menarche$Total-menarche$Menarche))))
fit.ungrouped <- glm(Menarche ~ Age, binomial, data=dat.ungrouped)
fit.ungrouped.0 <- glm(Menarche ~ 1, binomial, data=dat.ungrouped)
cat(sprintf("McFadden's R^2 grouped: %f\nMcFadden's R^2 ungrouped: %f\n",
1 - logLik(fit.grouped)/logLik(fit.grouped.0),
1 - logLik(fit.ungrouped)/logLik(fit.ungrouped.0)))
McFadden's R^2 grouped: 0.970684
McFadden's R^2 ungrouped: 0.691075
Best Answer
In short, the grouped model is computing the log-likelihood ignoring the idea that the
Total
variable reflects repeated observations. Thus, it is treating each observation in the data frame as an equally weighted individual observation and, effectively, upweighting the smaller cells and downweighting the larger cells.More broadly, the
logLik
function bases the computation forglm
on the AIC.The AIC, in turn, in
glm
calls an AIC function from thebinomial()
function which, as supplied fromglm
doesn't consider that the weights might be considered observations. This is noted in the documentation toglm
. Specifically here, weights are considered 'prior weights' and there is not a way forglm
to treat these 'as observations' in the way, for example, the Stata software's[fweights=...]
would be.In the end,
glm
was not designed with "frequency weights" in mind and though they can work that way to obtain correct coefficients, they do not consider the weights as surrogates for observations.Below I printed a few results that are based on my reproduction of these results. In particular, see the AIC values and
nobs
results.The ungrouped results are correct in this case as those reflect the correct number of observations underlying the model.