Solved – Understanding Simpson’s paradox: Andrew Gelman’s example with regressing income on sex and height

interactionregressionsimpsons-paradox

Andrew Gelman in one of his recent blog posts says:

  1. I do not think counterfactuals or potential outcomes are necessary for Simpson’s paradox. I say this because one can set up Simpson’s
    paradox with variables that cannot be manipulated, or for which
    manipulations are not directly of interest.

  2. Simpson’s paradox is part of a more general issue that regression coefs change if you add more predictors, the flipping of sign is not
    really necessary.

Here’s an example that I use in my teaching that illustrates both
points:

I can run a regression predicting income from sex and height. I find
that the coef of sex is \$10,000 (i.e., comparing a man and woman of
the same height, on average the man will make \$10,000 more) and the
coefficient of height is \$500 (i.e., comparing two men or two women
of different heights, on average the taller person will make \$500
more per inch of height).

How can I interpret these coefs? I feel that the coef of height is
easy to interpret (it’s easy to imagine comparing two people of the
same sex with different heights), indeed it would seem somehow “wrong”
to regress on height without controlling for sex, as much of the raw
difference between short and tall people can be “explained” by being
differences between men and women. But the coef of sex in the above
model seems very difficult to interpret: why compare a man and a woman
who are both 66 inches tall, for example? That would be a comparison
of a short man with a tall woman. All this reasoning seems vaguely
causal but I don’t think it makes sense to think about it using
potential outcomes.

I pondered over it (and even commented on the post) and think there's something that begs to be understood with greater clarity here.

Until the part on interpretation of gender it is so okay. But I do not see what's the problem behind comparing a short man and a tall woman. Here's my point: In fact it makes even greater sense (given the assumption that men are taller on average). You cannot compare a 'short man' and a 'short' woman for exactly the same reason, that the difference in income is explained in some part by the difference in heights. Same goes for tall men and tall women and even more so for short women and tall men (which is further out of the question, so to speak). So basically the effect of height is eliminated only in the case when short men and tall women are compared (and this helps in interpreting the coefficient on gender). Doesn't it ring a bell on similar underlying concepts behind the popular matching models?

The idea behind Simpson's paradox is that the population effect might be different from the sub-group wise effect(s). This is in some sense related to his point 2 and the fact that he acknowledges that height should not be controlled for alone (what we say omitted variable bias). But I could not relate this to the controversy on the coefficient on gender.

Maybe you might be able to express it more clearly? Or comment on my understanding?

Best Answer

I'm not totally sure of your question, but can remark on his claims and your confusion in the example model.

Andrew is not quite clear if scientific interest lies in the height adjusted sex-income association or the sex adjusted height-income association. In a causal model framework sex causes height but height does not cause sex. So if we want the impact of sex, adjusting for height would introduce mediator bias (possibly collider bias too, since rich people are taller!). I find it confusing and funny when I see applied research that interprets the other "covariates" (confounders and precision variables) which are included in a model. They are nonsense, but simply provide adequate stratification to make the comparison that is necessary. Adjusting for height, if you are interested in inference on sex based differences in income, is the wrong thing to do.

I agree counterfactuals are not necessary to explain Simpson's paradox. They can be simply a trait intrinsic to data. I think both crude and adjusted RRs are in some sense correct without being causal. It is more problematic, of course, when the objective is causal analysis, and overadjustment reveals problems of non-collapsibility (which inflates an OR) and insufficient sample size.

As a reminder for the readers: Simpson's paradox is a very specific phenomenon that refers to an instance in which an association flips direction after controlling for a confounding variable. The Berkeley Admissions data was the motivating example. There, crude RRs showed women were less likely to be accepted to Berkeley. However, once stratified by departments, the RRs showed that women were more likely to be accepted in every single department. They just were more likely to apply to the difficult departments that rejected many people.

Now in causal inference theory, we would be befuddled to conceive that the department one applied to causes gender. Gender is intrinsic right? Well, yes and no. Miettenen argues for a "study base" approach to such problems: who is the population? It is not all eligible students, it is the ones who specifically apply to Berkeley. The more competitive departments have attracted the women to apply to Berkeley when they would not have applied otherwise. To expand: a woman who is profoundly intelligent wants to get into the best, say, engineering program. If Berkeley had not had a great engineering program, she would not have applied to Berkeley anyway, she would have applied to MIT or CalPoly. So in that light, the "applying student" population, department causes gender and is a confounder. (caveat: I'm a first gen college student so don't know much about which programs are renowned for what).

So how do we summarize this data? It is true that Berkeley were more likely to admit a man who applied than a woman. And it is true that the departments of Berkeley were more likely to admit women than to admit men. Crude and stratified RRs are sensible measures even if they are non-causal. This underscores how important it is to be precise with our wording as statisticians (the humble author does not presume himself to be remotely precise).

Confounding is a phenomenon distinct from non-collapsibility, another form of omitted variable bias but one which is known to produce milder effects on estimates. Unlike logistic regression, non-collapsibilty does not cause bias in linear regression and the consideration of a continuous in Gelman's example should have been described more thoroughly.

Andrew's interpretation of the sex coefficient in his sex / height adjusted income model reveals the nature of the model's assumptions: the assumption of linearity. Indeed in the linear model, such comparisons between men and women are enabled because for a specific woman, we can predict what a similar height male may have earned, even if he wasn't observed. This is also the case if one allows for effect modification, so that the slope of the trend in women is different from than that of men. On the other hand, I don't think it's so crazy to conceive of men and women of the same height, 66 inches indeed would be a tall woman and short man. It seems a mild projection to me, rather than gross extrapolation. Furthermore, since the model assumptions can be stated clearly, it helps readers understand that the sex stratified income-height association bears information which is borrowed across or averaged between samples of males and females. If such an association were the object of inference, the earnest statistician would obviously consider the possibility of effect modification.