Solved – Generalized Linear Model in SPSS with common values among predictors treated as subpopulations. Why

aicdegrees of freedomgeneralized linear modellogisticspss

I am teaching a class on logistic regression with SPSS. The textbook supplies a sample data set with a binary predictor and two numeric covariates. The sample contains 1000 rows and a number of these entries have common values for both predictors. One predictor takes only 5 values, for example, and the other takes around 20 distinct values.

According to the documentation of SPSS, when this happens, SPSS treats the data as coming from subpopulations, defined through the common values. This seems to produce a different likelihood and different degrees of freedom for the AIC than what you get if you ignore subpopulations.

I ran the data set in R, using glm. Degrees of freedom were 997, AIC=508.93

On SPSS, I get 99 degrees of freedom (for goodness of fit purposes) and AIC=181.341. The coefficient estimates are the same in both applications.

To make matters worse, when I fit the model in SPSS with only 1 of the 2 predictors, the likelihood is LARGER than with the 2 predictor model: -87 for the 2 parameter model, and -47 for the 1 parameter model. The AIC is also dramatically smaller in the 1 parameter model, but everything else suggests that both predictors are significant and necessary. So much for the AIC criterion.

I jittered the data in R, and sent it back to SPSS. I then got much the same results as in R with glm, since there were no phantom "subpopulations" for SPSS to cope with.

Questions:

can someone supply a reference to justify treating the data as coming from subpopulations (which they actually don't in this case) when the predictors contain common value sets?
How am I supposed to teach model testing by comparing the deviance between two models, using SPSS and this data set, given what's going on?
Can I make SPSS behave like R?

Best Answer

Apparently you are using the NOMREG procedure. From the SPSS NOMREG help. Note that you can also use the newer GENLIN procedure to fit a logistic model. All three will give the same coefficients and standard errors but may differ in other outputs.

Binary logistic regression models can be fitted using either the Logistic Regression procedure or the Multinomial Logistic Regression procedure. Each procedure has options not available in the other. An important theoretical distinction is that the Logistic Regression procedure produces all predictions, residuals, influence statistics, and goodness-of-fit tests using data at the individual case level, regardless of how the data are entered and whether or not the number of covariate patterns is smaller than the total number of cases, while the Multinomial Logistic Regression procedure internally aggregates cases to form subpopulations with identical covariate patterns for the predictors, producing predictions, residuals, and goodness-of-fit tests based on these subpopulations. If all predictors are categorical or any continuous predictors take on only a limited number of values—so that there are several cases at each distinct covariate pattern—the subpopulation approach can produce valid goodness-of-fit tests and informative residuals, while the individual case level approach cannot.

Related Solutions

Solved – Running transformed data with a Generalized Linear Model in SPSS

First, your AICs aren't comparable across models with different target variables, in your case, $x$ and $\log(x+1)$. These target variables have different density functions - Tweedie and something that might be called "log(x+1) - Tweedie" - so the deviance calculation will result in different numbers.

To see a concrete example of this, in R unfortunately (given that you're using SPSS), I generate 1,000 standard normal variates $x$, then calculate $-2\log \text{Likelihood}$ for them and for $\exp\{x\}$, using in the former case the standard Normal distribution and the latter the standard Lognormal distribution:

> z <- rnorm(1000)
> -2*sum(dnorm(z,log=TRUE))
[1] 2902.033
> -2*sum(dlnorm(exp(z),log=TRUE))
[1] 2824.019

However, $z$ really is distributed according to a standard Normal, and $\exp\{z\}$ really is distributed according to a standard lognormal. Both models are correct, but obviously they will generate different AICs given that they are generating different likelihoods.

The change in link function from log to identity doesn't make up for that. That is because the link function only relates the mean to the linear predictors, not the functional form of the distribution. You can have a Normal variate with $\mu = X\beta$ or $\log\mu = X\beta$, and either way it's still a Normal variate - you can observe a negative value for your target variable, for example, regardless of which link function you use, but you can't if the target variable is actually distributed lognormally.

On to your actual question - it's perfectly valid to transform your target variable however you want before modeling it. You can also pile any link function you want, well any feasible one at any rate (no negative values for the mean of a Poisson!), on top of your transform. It's all just math at this point. The difficulty comes before and after the math - the "before" being identifying reasonable transforms, distributions, and link functions given your data and objectives, and the "after" being interpreting the results. Plots are a big help here, and plots of residuals from a model vs. estimated values from the model can also be helpful when iterating through model designs.

Unfortunately I know nothing about your application area and the standard models in use, so I can't offer any more specific help.

Solved – Selecting the best GLM (generalized linear model)

If you want to find the best model for your data, one way to go could be using the function dropterm()from package MASS. It automatically test all models that differ from the current model by the dropping of one single term. This is done respecting marginality, so it doesn't try models in which one main effect is dopped if the same predictor is also present in any interaction (I think there is no good reason to fit such models anyway).

You could start with a model where all terms (main effects and interactions) are present, and do a backward simplification: when you run the function dropterm you can ask the function to compare all possible reduced model with a likelihood ratio test or also to order them according to the AIC; then you can update your model removing superfluous predictors. You can repeat these step several times, until there are no more predictors that can be removed without causing a significant drop in the goodness of fit of the model (according to either the AIC or the likelihood ratio test), indicating that you have found the best GLM model for your data.

Best Answer

Related Solutions

Solved – Running transformed data with a Generalized Linear Model in SPSS

Solved – Selecting the best GLM (generalized linear model)

Related Question