Solved – Comparing logistic regression coefficients across models

logisticregressionregression coefficientsspss

I've developed a logit model to be applied to six different sets of cross-sectional data. What I'm trying to uncover is whether there are changes in the substantive effect of a given independent variable (IV) on the dependent variable (DV) controlling for other explanations at different times and across time.

My questions are:

  • How do I assess increased / decreased size in the association between the IV and DV?

  • Can I simply look at the different magnitudes (sizes) of the coefficients across the models or do I need to go through some other process?

  • If I need to do something else, what is it and can it be done/how do I do it in SPSS?

    Also, within a single model,

  • Can I compare the relative size of independent variables based on unstandardised scores if all are coded 0-1 or do I need to convert them to standardised scores?

  • Are there problems involved with standardised scores?

Best Answer

I will mainly focus on your first three questions. The short answers are: (1) you need to compare the effect of the IV on the DV for each time period but (2) only comparing the magnitudes can lead to wrong conclusions, and (3) there are many ways of doing that but no consensus on which one is correct.

Below I describe why you cannot simply compare coefficient magnitudes and point you to some solutions that have been thought of so far.

According to Allison (1999), unlike OLS, logistic regression coefficients are affected by unobserved heterogeneity even when such heterogeneity is not related to the variable of interest.

When you fit a logistic regression like:

(1)$$ \ln\bigg(\frac{1}{1-p_i}\bigg) = \beta_{0} + \beta_{1}x_{1i} $$

You are in fact fitting an equation predicting the value of a latent variable $y^*$ that represents the underlying propensity of each observation to assume the value $1$ in the binary dependent variable, what happens if $y^*$ is above a certain threshold. The equation for that is (Williams, 2009):

(2)$$ y^* =\alpha_{0} + \alpha_{1}x_{1i} + \sigma \varepsilon $$

The term $\varepsilon$ is assumed to be independent from the other terms and to follow a logistic distribution – or a normal distribution in the case of probit and a log-logistic distribution in case of complementary log-log and a cauchy distribution in the case of cauchit.

According to Williams (2009), the $\alpha$ coefficients in equation 2 are related to the $\beta$ coefficients in equation 1 through:

(3)$$ \beta_{j} = \frac{\alpha_{j}}{\sigma}\;\;j=1,...,J. $$

In equations 2 and 3, $\sigma$ is the scaling factor of the unobserved variation, and we can see that the size of the estimated $\beta$ coefficients depends on $\sigma$, which is not observed. Based on that, Allison (1999), Williams (2009), and Mood (2009), among others, claim that you cannot naively compare coefficients between logistic models estimated for different groups, countries or periods.

This is because comparisons may yield incorrect conclusions if the unobserved variation differs between groups, countries or periods. Both comparisons using different models and using interaction terms within the same model suffer from this problem. Besides logit, this also applies to its cousins probit, clog-log, cauchit and, by extension, to discrete time hazard models estimated using these link functions. Ordered logit models are also affected by it.

Williams (2009) argues that the solution is to model the unobserved variation through a heterogeneous choice model (a.k.a., a location-scale model), and provides a Stata add on called oglm for that (Williams 2010). In R, heterogeneous choice models can be fit with the hetglm() function of the glmx package, which is available through CRAN. Both programs are very easy to use. Lastly, Williams (2009) mentions SPSS's PLUM routine for fitting these models, but I have never used it and cannot comment in how easy it is to use.

However, there is at least one working paper out there showing that comparisons using heterogeneous choice models can be even more biased if the variance equation is misspecified or there is measurement error.

Mood (2010) lists other solutions that do not involve modelling the variance, but use comparisons of predicted probability changes.

Apparently it is an issue that is not settled and I often see papers in conferences of my field (Sociology) coming up with different solutions for it. I would advise you to look at what people in your field do and then decide how to deal with it.

References

Related Question