Pseudo-R-Squared – Why McFadden’s Values Differ by Data Grouping

logisticpseudo-r-squared

Just stumbled accross the problem that McFaddens Pseudo-$R^2$ yields different values in logistic regression, depending on the grouping of the data. When predictor values occur more than once, there are two ways to specify the model:

  1. as binary data with each response being 0 or 1
  2. as data grouped by predictor values with frequencies and weights

Both ways to specify the model in R's glm yield exactly the same coefficients and confidence intervals, but different values for McFadden's Pseudo-$R^2$.

Why is this so? Does this not make McFadden's Pseudo-$R^2$ a very dubious $R^2$ candidate? Or is this not a bug, but a feature?

Let me illustrate the problem by an example with the dataset menarche in the MASS package from vanilla R:

fit.grouped <- glm(Menarche/Total ~ Age, binomial, weights=Total, data=menarche)
fit.grouped.0 <- glm(Menarche/Total ~ 1, binomial, weights=Total, data=menarche)
dat.ungrouped <- data.frame(Age=c(rep(menarche$Age, menarche$Menarche),
                                  rep(menarche$Age, menarche$Total-menarche$Menarche)),
                            Menarche=c(rep(1, sum(menarche$Menarche)),
                                       rep(0, sum(menarche$Total-menarche$Menarche))))
fit.ungrouped <- glm(Menarche ~ Age, binomial, data=dat.ungrouped)
fit.ungrouped.0 <- glm(Menarche ~ 1, binomial, data=dat.ungrouped)
cat(sprintf("McFadden's R^2 grouped:    %f\nMcFadden's R^2 ungrouped:  %f\n",
            1 - logLik(fit.grouped)/logLik(fit.grouped.0),
            1 - logLik(fit.ungrouped)/logLik(fit.ungrouped.0)))

McFadden's R^2 grouped:    0.970684
McFadden's R^2 ungrouped:  0.691075

Best Answer

In short, the grouped model is computing the log-likelihood ignoring the idea that the Total variable reflects repeated observations. Thus, it is treating each observation in the data frame as an equally weighted individual observation and, effectively, upweighting the smaller cells and downweighting the larger cells.

More broadly, the logLik function bases the computation for glm on the AIC.

The AIC, in turn, in glm calls an AIC function from the binomial() function which, as supplied from glm doesn't consider that the weights might be considered observations. This is noted in the documentation to glm. Specifically here, weights are considered 'prior weights' and there is not a way for glm to treat these 'as observations' in the way, for example, the Stata software's [fweights=...] would be.

In the end, glm was not designed with "frequency weights" in mind and though they can work that way to obtain correct coefficients, they do not consider the weights as surrogates for observations.

Below I printed a few results that are based on my reproduction of these results. In particular, see the AIC values and nobs results.

The ungrouped results are correct in this case as those reflect the correct number of observations underlying the model.

> fit.grouped

Call:  glm(formula = Menarche/Total ~ Age, family = binomial, data = menarche, 
    weights = Total)

Coefficients:
(Intercept)          Age  
    -21.226        1.632  

Degrees of Freedom: 24 Total (i.e. Null);  23 Residual
Null Deviance:      3694 
Residual Deviance: 26.7     AIC: 114.8
> fit.ungrouped

Call:  glm(formula = Menarche ~ Age, family = binomial, data = dat.ungrouped)

Coefficients:
(Intercept)          Age  
    -21.226        1.632  

Degrees of Freedom: 3917 Total (i.e. Null);  3916 Residual
Null Deviance:      5306 
Residual Deviance: 1639     AIC: 1643

> nobs(fit.grouped)
[1] 25

> nobs(fit.ungrouped)
[1] 3918
```
Related Question