Solved – Generalized Linear Model in SPSS with common values among predictors treated as subpopulations. Why

aicdegrees of freedomgeneralized linear modellogisticspss

I am teaching a class on logistic regression with SPSS. The textbook supplies a sample data set with a binary predictor and two numeric covariates. The sample contains 1000 rows and a number of these entries have common values for both predictors. One predictor takes only 5 values, for example, and the other takes around 20 distinct values.

According to the documentation of SPSS, when this happens, SPSS treats the data as coming from subpopulations, defined through the common values. This seems to produce a different likelihood and different degrees of freedom for the AIC than what you get if you ignore subpopulations.

I ran the data set in R, using glm. Degrees of freedom were 997, AIC=508.93

On SPSS, I get 99 degrees of freedom (for goodness of fit purposes) and AIC=181.341. The coefficient estimates are the same in both applications.

To make matters worse, when I fit the model in SPSS with only 1 of the 2 predictors, the likelihood is LARGER than with the 2 predictor model: -87 for the 2 parameter model, and -47 for the 1 parameter model. The AIC is also dramatically smaller in the 1 parameter model, but everything else suggests that both predictors are significant and necessary. So much for the AIC criterion.

I jittered the data in R, and sent it back to SPSS. I then got much the same results as in R with glm, since there were no phantom "subpopulations" for SPSS to cope with.

Questions:

  1. can someone supply a reference to justify treating the data as coming from subpopulations (which they actually don't in this case) when the predictors contain common value sets?
  2. How am I supposed to teach model testing by comparing the deviance between two models, using SPSS and this data set, given what's going on?
  3. Can I make SPSS behave like R?

Best Answer

Apparently you are using the NOMREG procedure. From the SPSS NOMREG help. Note that you can also use the newer GENLIN procedure to fit a logistic model. All three will give the same coefficients and standard errors but may differ in other outputs.

Binary logistic regression models can be fitted using either the Logistic Regression procedure or the Multinomial Logistic Regression procedure. Each procedure has options not available in the other. An important theoretical distinction is that the Logistic Regression procedure produces all predictions, residuals, influence statistics, and goodness-of-fit tests using data at the individual case level, regardless of how the data are entered and whether or not the number of covariate patterns is smaller than the total number of cases, while the Multinomial Logistic Regression procedure internally aggregates cases to form subpopulations with identical covariate patterns for the predictors, producing predictions, residuals, and goodness-of-fit tests based on these subpopulations. If all predictors are categorical or any continuous predictors take on only a limited number of values—so that there are several cases at each distinct covariate pattern—the subpopulation approach can produce valid goodness-of-fit tests and informative residuals, while the individual case level approach cannot.