I'm not totally sure of your question, but can remark on his claims and your confusion in the example model.
Andrew is not quite clear if scientific interest lies in the height adjusted sex-income association or the sex adjusted height-income association. In a causal model framework sex causes height but height does not cause sex. So if we want the impact of sex, adjusting for height would introduce mediator bias (possibly collider bias too, since rich people are taller!). I find it confusing and funny when I see applied research that interprets the other "covariates" (confounders and precision variables) which are included in a model. They are nonsense, but simply provide adequate stratification to make the comparison that is necessary. Adjusting for height, if you are interested in inference on sex based differences in income, is the wrong thing to do.
I agree counterfactuals are not necessary to explain Simpson's paradox. They can be simply a trait intrinsic to data. I think both crude and adjusted RRs are in some sense correct without being causal. It is more problematic, of course, when the objective is causal analysis, and overadjustment reveals problems of non-collapsibility (which inflates an OR) and insufficient sample size.
As a reminder for the readers: Simpson's paradox is a very specific phenomenon that refers to an instance in which an association flips direction after controlling for a confounding variable. The Berkeley Admissions data was the motivating example. There, crude RRs showed women were less likely to be accepted to Berkeley. However, once stratified by departments, the RRs showed that women were more likely to be accepted in every single department. They just were more likely to apply to the difficult departments that rejected many people.
Now in causal inference theory, we would be befuddled to conceive that the department one applied to causes gender. Gender is intrinsic right? Well, yes and no. Miettenen argues for a "study base" approach to such problems: who is the population? It is not all eligible students, it is the ones who specifically apply to Berkeley. The more competitive departments have attracted the women to apply to Berkeley when they would not have applied otherwise. To expand: a woman who is profoundly intelligent wants to get into the best, say, engineering program. If Berkeley had not had a great engineering program, she would not have applied to Berkeley anyway, she would have applied to MIT or CalPoly. So in that light, the "applying student" population, department causes gender and is a confounder. (caveat: I'm a first gen college student so don't know much about which programs are renowned for what).
So how do we summarize this data? It is true that Berkeley were more likely to admit a man who applied than a woman. And it is true that the departments of Berkeley were more likely to admit women than to admit men. Crude and stratified RRs are sensible measures even if they are non-causal. This underscores how important it is to be precise with our wording as statisticians (the humble author does not presume himself to be remotely precise).
Confounding is a phenomenon distinct from non-collapsibility, another form of omitted variable bias but one which is known to produce milder effects on estimates. Unlike logistic regression, non-collapsibilty does not cause bias in linear regression and the consideration of a continuous in Gelman's example should have been described more thoroughly.
Andrew's interpretation of the sex coefficient in his sex / height adjusted income model reveals the nature of the model's assumptions: the assumption of linearity. Indeed in the linear model, such comparisons between men and women are enabled because for a specific woman, we can predict what a similar height male may have earned, even if he wasn't observed. This is also the case if one allows for effect modification, so that the slope of the trend in women is different from than that of men. On the other hand, I don't think it's so crazy to conceive of men and women of the same height, 66 inches indeed would be a tall woman and short man. It seems a mild projection to me, rather than gross extrapolation. Furthermore, since the model assumptions can be stated clearly, it helps readers understand that the sex stratified income-height association bears information which is borrowed across or averaged between samples of males and females. If such an association were the object of inference, the earnest statistician would obviously consider the possibility of effect modification.
"Accepting H$_{0}$" is always a logical fallacy (i.e. lack of significance is always "failed to reject"). Interpretively, this means you did not find evidence of the interaction grade*sex
.
The reason why you can only state that you did not find evidence of X with tests for difference is that these tests only provide evidence of how likely you are to see $\hat{\beta}_{\text{grade}\times\text{sex}}$ if H$_{0}$ is true, and your test only yields your desired power to reject for at least as large as one not all (smaller) possible values under H$_{\text{A}}$.
If you want to state that you found or did not find evidence of an absence of X, then you need to use, for example, tests for equivalence (say, using two one-sided tests) where H$_{0}$ no longer takes the form H$_{0}^{+}\text{: }\theta=0$, but rather takes the form H$_{0}^{-}\text{: }|\theta|=\Delta$, where $\Delta$ is a researcher-specified value meaning "too small a difference to care about". (the '$+$' and '$-$' in the superscript indicate null hypotheses for difference and for equivalence, respectively.)
To perform an equivalence test on grade*sex
(i.e. you want to provide evidence that there is no interaction), you will need a few things:
- $\theta$: the effect you are estimating for
grade*sex
(i.e. the coefficient $\hat{\beta}_{\text{grade}\times\text{sex}}$)
- $\Delta$: an effect size that is too small to care about (e.g. we do not care about $-0.1 \leq \beta_{\text{grade}\times\text{sex}} \leq 0.1$. A $\Delta=0.1$ is not magical, and I only use it here as a imaginary value of $\Delta$, you need to decide).
- $s_{\theta}$ the standard error of your estimate (i.e. the standard error of $\hat{\beta}_{\text{grade}\times\text{sex}}$)
Given that, then:
H$_{0}^{-}\text{: }|\beta_{\text{grade}\times\text{sex}}| \ge \Delta$, which gives two one-sided null hypotheses:
H$_{01}\text{: }\beta_{\text{grade}\times\text{sex}} \ge \Delta$, and
H$_{02}\text{: }\beta_{\text{grade}\times\text{sex}} \le -\Delta$
The test statistics corresponding to both of these are:
$$t_{1} = \frac{\Delta - \hat{\beta}_{\text{grade}\times\text{sex}}}{s_{\hat{\beta}_{\text{grade}\times\text{sex}}}}$$
$$t_{2} = \frac{\hat{\beta}_{\text{grade}\times\text{sex}}+ \Delta}{s_{\hat{\beta}_{\text{grade}\times\text{sex}}}}$$
These are both right-side/upper tail tests, so you get the p-values:
$p_{1}=\text{P}\left(T_{df} \ge t_{1}\right)$, and
$p_{2}=\text{P}\left(T_{df} \ge t_{2}\right)$
If both H$_{01}^{-}$ and H$_{02}^{-}$ are rejected with $p\le\alpha$ (not $p \le \alpha/2$), then, taken together with the failure to reject H$_{0}^{+}$ you can conclude you found evidence that the grade*sex
is equivalent to zero, given $\alpha$ and $\Delta$.
However, if you reject only one or reject neither of H$_{01}^{-}$ and H$_{02}^{-}$, then, taken together with the failure to reject H$_{0}^{+}$ you can't conclude anything: your results are indeterminate because your data are underpowered.
Best Answer
If you are conducting studies to deepen your theoretical understanding of some topic, this is a great thing to wonder about. Fortunately, there are well-developed statistical methods for assessing this question. What you do is fit both a full model that encompasses the possibility that the relationship differs between men and women, and a reduced model that assumes there is no such difference. Then you perform a nested model test.
The way to make a model that allows for there to be a differing relationship by sex is to include an interaction term in addition to variables for income and sex. Here is what such a model would look like:
$$ \text{Happiness}=\beta_0 + \beta_1\text{Income} + \beta_2\text{Sex} + \beta_3\text{Income}\times\text{Sex} + \varepsilon $$ Note that sex would be represented by a dummy code, that is, a vector of $1$s and $0$s, where the $1$s indicated, e.g., that the person was a man. The reduced model would look like this:
$$ \text{Happiness}=\beta_0 + \beta_1\text{Income} + \varepsilon $$ Thus, the models differ in two parameters, and the larger model 'reduces' to the smaller one if $\beta_2=\beta_3=0$. To simultaneously test whether both parameters are 0, you perform a nested model test. (I have discussed such tests here: Testing for moderation with continuous vs categorical moderators, albeit in a different context.)
If you decide to keep the larger model, the implication is that the relationship between income an happiness for women is:
$$ \text{Happiness}=\beta_0 + \beta_1\text{Income} + \varepsilon $$ And the relationship for men is:
$$ \text{Happiness}=\underbrace{(\beta_0 + \beta_2)}_{\text{intercept}} + \underbrace{(\beta_1+\beta_3)}_{\text{slope}}\text{Income} + \varepsilon $$ (Again, this assumes that men are $1$, and women are $0$.)