It is best to separate the estimation problem from identification of the parameter of interest.
When we use diff-in-diff, we want to estimate an average effect.
The first step is to show that this average effect is identified (that is, calculable from data that we observe).
The second is to construct an estimator that estimates the average effect without bias.
I present an identification argument that spells out the assumptions needed for diff-in-diff to be unbiased.
Take causal inference in one time period.
The goal is to identify the average treatment effect (ATE) or the average treatment effect on the treated (ATT).
In the potential outcomes notation,
$T_i \in \{0, 1\}$ is unit $i$'s treatment assignment,
$Y_i(1)$ is unit $i$'s outcome under treatment,
$Y_i(0)$ is its outcome without treatment,
and $Y_i(T_i)$ is its observed outcome.
The treatment effect on unit $i$ is $Y_i(1) - Y_i(0) \equiv \delta_i$.
The ATE is $\mathbb{E}(\delta_i)$
and the ATT is $\mathbb{E}(\delta_i \mid T_i = 1)$.
Both the ATE and the ATT as parameters of interest are only identified if $\delta_i \perp T_i$.
That is, unit $i$'s treatment effect needs to be independent of its treatment assignment.
Under this assumption, the ATE is also equal to the ATT.
Diff-in-diff relaxes this assumption by working in a two-period setting.
With two periods, the potential outcomes are $Y_{it}(1)$ and $Y_{it}(0)$ for time periods $t \in \{0, 1\}$, and the observed outcomes are $Y_{it} \equiv Y_{it}(T_i)$.
Let the change in the outcome from the first to the second period be
$\theta_i(1) \equiv Y_{i1}(1) - Y_{i0}(1)$ if unit $i$ is assigned the treatment,
and $\theta_i(0) \equiv Y_{i1}(0) - Y_{i0}(0)$ if it is not.
The treatment effect on unit $i$ is
$\delta_i \equiv \theta_i(1) - \theta_i(0)$.
Diff-in-diff assumes that $\theta_i(0) \perp T_i$
so that the change in the absence of treatment is the same for the treated units as it is for the untreated.
This is called the parallel trends assumption.
Parallel trends are sufficient to identify the ATT
because we can calculate $\mathbb{E}(Y_{i1} - Y_{i0} \mid T_i = 1) - \mathbb{E}(Y_{i1} - Y_{i0} \mid T_i = 0)$ from the data and it is equal to
\begin{align*}
&\mathbb{E}(Y_{i1}(1) - Y_{i0}(1) \mid T_i = 1) - \mathbb{E}(Y_{i1}(0) - Y_{i0}(0) \mid T_i = 0) = \\
&\quad = \mathbb{E}(\theta_i(1) \mid T_i = 1) - \underbrace{\mathbb{E}(\theta_i(0) \mid T_i = 0)}_{= \mathbb{E}(\theta_i(0) \mid T_i = 1) \;\text{by parallel trends}} = \\
&\quad = \mathbb{E}(\theta_i(1) - \theta_i(0) \mid T_i = 1) = \\
&\quad = \mathbb{E}(\delta_i \mid T_i = 1).
\end{align*}
Note that diff-in-diff does allow $\theta_i(1) \not\perp T_i$, so the change under treatment can be different for treated units than for the untreated.
It also allows $Y_{i1}(1) - Y_{i1}(0) \not\perp T_i$, so the treatment effect in the second period can be different for treated units than for the untreated.
If we also assumed $\theta_i(1) \perp T_i$, then the ATE, $\mathbb{E}(\delta_i)$, would also be identified.
However, in this case the ATE and the ATT would again be equal.
Finding an omitted variable that explains both the outcome and the treatment assignment is a sign that parallel trends might be violated.
But this need not be the case.
For example, consider $W_{it}$ as the omitted variable and suppose $W_{i0} \not\perp T_i$ and $W_{i0} \not\perp Y_{i0}(0)$.
Then $W_{i0} \not\perp Y_{i1}(0) - Y_{i0}(0) = \theta_i(0)$.
However, $W_{i0} \not\perp T_i$ and $W_{i0} \not\perp \theta_i(0)$ does not imply that $T_i \not\perp \theta_i(0)$,
so the ATT might still be identified and diff-in-diff might still be unbiased.
This would be the case if the direction of causality ran from $Y_{it}(0)$ to $W_{i0}$ and from $T_i$ to $W_{i0}$, and not the other way round.
In practice we would think about what our theory tells us about the direction of causality.
If the omitted variable was a consequence of the treatment, we would not be concerned.
But we would if the omitted variable influenced treatment assignment instead,
or if a third factor influenced both the omitted variable and treatment assignment.
The ATE is an estimand involving unseen potential outcomes and is defined at $E[Y^1-Y^0]$, where $Y^1$ and $Y^0$ are the potential outcomes under treatment and control. Under the main causal assumptions, the ATE is equal to $E[E[Y|A = 1, V]-E[Y|A=0, V]]$, where $V$ is a valid adjustment set. Let's call $E[E[Y|A = 1, V]-E[Y|A=0, V]]$ the average marginal effect (AME), which doesn't have a causal interpretation except when the assumptions that make the ATE equal to the AME are satisfied. The AME is also an estimand, but it doesn't require specific causal assumptions to be true to estimate it. It is possible there are multiple sets $V$ that make the AME with respect to $V$ equal to the ATE.
When a model is parameterized in a certain way, it is possible for a parameter in that model to correspond to the AME under certain assumptions that link the model parameter to the estimand.
Consider the following estimands:
- $AME_{12} = E[E[Y|A = 1, X_1, X_2]-E[Y|A=0, X_1, X_2]]$
- $AME_2 = E[E[Y|A = 1, X_2]-E[Y|A=0, X_2]]$
Under DAG 1, $AME_{12}$ is equal to the ATE, and $AME_2$ is a confounded association between $A$ and $Y$. Under DAG 2, $AME_2$ is equal to the ATE, and $AME_{12}$ is the direct effect of $A$ on $Y$ not through $X_1$.
Consider that the true outcome model is linear in the covariates and treatment and that there is no interaction between the treatment and covariates (i.e., so that your first model perfectly describes the data-generating process, which is consistent with both DAG 1 and DAG 2). Under this assumption, in your first model, $\beta_{11}$ is equal to $AME_{12}$, and in your second model, $\beta_{21}$ is equal to $AME_2$.
So, under certain assumptions, a $\beta$ is equal to an AME, and under additional assumptions, the AME is equal to the ATE. So what quantity does $\hat{\beta}_{21}$ in an OLS regression correspond to in your second model estimate? It estimates $\beta_{21}$. How you interpret that with respect to an estimand depends on the assumptions you make that link $\beta_{21}$ to the estimand you desire.
It is possible to estimate the AME using a different method, e.g., inverse probability weighting (IPW). IPW does not involve specifying a regression model for the outcome; therefore, the IPW estimand does not necessarily correspond to $\beta$ in any regression model. In this way, even if we aren't willing to make the assumptions that would link $\beta$ in some regression model to the AME, we can still use IPW to estimate the AME. This is important because we can describe the AME as an estimand separate from $\beta$, which hopefully clarifies that $\beta$ and the AME are not the same estimand except when specific assumptions link them. Similarly, IPW does not target $\beta$ except when $\beta$ is equal to the AME by virtue of the linking assumptions.
Let's wrap it up: the ATE, $AME_{12}$, $AME_2$, $\beta_{12}$ and $\beta_2$ are all potential estimands. The OLS estimator of $\hat{\beta}_{21}$ is generally unbiased for $\beta_{21}$. Under certain assumptions, $\beta_{21}$ may be equal to $AME_2$. Under additional assumptions, $AME_2$ might be equal to the ATE. If these assumptions are all true, then you can say $\hat{\beta}_{21}$ is an unbiased estimator of the ATE. But, again, whether that is true depends on the assumptions linking each quantity to the next; some of those assumptions are encoded in the DAG and others in the form of the outcome model.
Best Answer
Omitted variable bias (OVB) is agnostic to the causal relationship between $X$ and $Z$. It concerns only the ability to estimate $\tau$ in the structural model for $Y$. The joint distribution of $Y$, $X$, and $Z$ is compatible both with a data-generating process in which $Z$ is a confounder of the $X \rightarrow Y$ relationship, so that $\tau$ represents the total effect of $X$ on $Y$, and with a data-generating process in which $Z$ is a mediator of the $X \rightarrow Y$ relationship, so that $\tau$ represents the direct effect of $X$ on $Y$.
In the confounding model, the data-generating process for $X$ and $Z$ is: $$ Z := \epsilon_Z \\ X := \gamma Z + \epsilon_X $$ In the mediation model, the data-genertaing process for $X$ and $Z$ is: $$ Z := \alpha X + \epsilon_Z \\ X := \epsilon_X $$
For the confounding process, omitting $Z$ from the model for $Y$ yields a biased estimate of $\tau$, the total effect of $X$ on $Y$. Thisis the classic bias due to an omitted confounder.
For the mediation process, the $X \rightarrow Y$ relationship is not confounded. The estimated coefficient $\hat \tau$ in the model omitting $Z$ is unbiased for the total causal effect of $X$ on $Y$. However, it is biased for $\tau$, the direct effect of $X$ on $Y$.
This is all to say that it's possible to have OVB without confounding if the coefficient you are trying to estimate is a direct effect, in which case omitting the mediator yields a biased estimate of this quantity. In the absence of confounding, the model omitting the mediator yields the total effect. The formula for the bias is the same regardless of the data-generating process of $X$ and $Z$, but the interpretation of the biased parameter depends on the causal relationship between $X$ and $Z$.