The main issue here is the nature of the omitted variable bias. Wikipedia states:
Two conditions must hold true for omitted-variable bias to exist in
linear regression:
- the omitted variable must be a determinant of the dependent variable (i.e., its true regression coefficient is not zero); and
- the omitted variable must be correlated with one or more of the included independent variables (i.e. cov(z,x) is not equal to zero).
It's important to carefully note the second criterion. Your betas will only be biased under certain circumstances. Specifically, if there are two variables that contribute to the response that are correlated with each other, but you only include one of them, then (in essence) the effects of both will be attributed to the included variable, causing bias in the estimation of that parameter. So perhaps only some of your betas are biased, not necessarily all of them.
Another disturbing possibility is that if your sample is not representative of the population (which it rarely really is), and you omit a relevant variable, even if it's uncorrelated with the other variables, this could cause a vertical shift which biases your estimate of the intercept. For example, imagine a variable, $Z$, increases the level of the response, and that your sample is drawn from the upper half of the $Z$ distribution, but $Z$ is not included in your model. Then, your estimate of the population mean response (and the intercept) will be biased high despite the fact that $Z$ is uncorrelated with the other variables. Additionally, there is the possibility that there is an interaction between $Z$ and variables in your model. This can also cause bias without $Z$ being correlated with your variables (I discuss this idea in my answer here.)
Now, given that in its equilibrium state, everything is ultimately correlated with everything in the world, we might find this all very troubling. Indeed, when doing observational research, it is best to always assume that every variable is endogenous.
There are, however, limits to this (c.f., Cornfield's Inequality). First, conducting true experiments breaks the correlation between a focal variable (the treatment) and any otherwise relevant, but unobserved, explanatory variables. There are some statistical techniques that can be used with observational data to account for such unobserved confounds (prototypically: instrumental variables regression, but also others).
Setting these possibilities aside (they probably do represent a minority of modeling approaches), what is the long-run prospect for science? This depends on the magnitude of the bias, and the volume of exploratory research that gets done. Even if the numbers are somewhat off, they may often be in the neighborhood, and sufficiently close that relationships can be discovered. Then, in the long run, researchers can become clearer on which variables are relevant. Indeed, modelers sometimes explicitly trade off increased bias for decreased variance in the sampling distributions of their parameters (c.f., my answer here). In the short run, it's worth always remembering the famous quote from Box:
All models are wrong, but some are useful.
There is also a potentially deeper philosophical question here: What does it mean that the estimate is being biased? What is supposed to be the 'correct' answer? If you gather some observational data about the association between two variables (call them $X$ & $Y$), what you are getting is ultimately the marginal correlation between those two variables. This is only the 'wrong' number if you think you are doing something else, and getting the direct association instead. Likewise, in a study to develop a predictive model, what you care about is whether, in the future, you will be able to accurately guess the value of an unknown $Y$ from a known $X$. If you can, it doesn't matter if that's (in part) because $X$ is correlated with $Z$ which is contributing to the resulting value of $Y$. You wanted to be able to predict $Y$, and you can.
To prove this, start from the probability limit of the OLS estimator. Let $X$ denote the full matrix of regressors to be used, $[1,X_1,X_2]$, and let $e \equiv u + b_3 X_3$. Also, let $b$ be the parameters we are trying to estimate, i.e. $b = (b_0,b_1,b_2)$.
\begin{align*}
p\lim \hat{\beta} &= p\lim \left[ (X'X)^{-1}X'Y \right]
\\ &= p\lim \left[ (X'X)^{-1}X'Y \right]
\\ &= p\lim \left[ (X'X)^{-1}X'(Xb + e) \right]
\\ &= p\lim \left[ (X'X)^{-1}X'Xb \right] + p\lim \left[ (X'X)^{-1}X'e \right]
\\ &= p\lim \left[ (X'X)^{-1}X'X \right] b + p\lim \left[ (X'X)^{-1}X'(b_3 X_3 + u) \right]
\\ &= b + b_3 p\lim \left[ (X'X)^{-1}X' X_3 \right] + p\lim \left[ (X'X)^{-1}X'u \right]
\\ &= b + b_3 p\lim \left[ (X'X)^{-1}X' X_3 \right]
\\ &= b + b_3 \mathbb{E}(X'X)]^{-1} \mathbb{E}(X' X_3)
\end{align*}
Above, a key step is of course that $p\lim \left[ (X'X)^{-1}X'u \right] =0$, which happens because
$$ p\lim \left[ (X'X)^{-1}X'u \right] = (p\lim X'X)^{-1} p\lim (X'u) = [\mathbb{E}(X'X)]^{-1} \mathbb{E}(X'u) $$, since $\mathbb{E}(X'u)=0$, which holds because the original assumption is that each of the regressors are uncorrelated with $u$ (but not necessarily $e$).
Now we see that $p\lim \hat{\beta} \ne b$ whenever $\mathbb{E}(X'X_3) \ne 0$, that is whenever there is correlation between $X_1$ and $X_3$ or between $X_2$ and $X_3$.
Best Answer
It is best to separate the estimation problem from identification of the parameter of interest. When we use diff-in-diff, we want to estimate an average effect. The first step is to show that this average effect is identified (that is, calculable from data that we observe). The second is to construct an estimator that estimates the average effect without bias. I present an identification argument that spells out the assumptions needed for diff-in-diff to be unbiased.
Take causal inference in one time period. The goal is to identify the average treatment effect (ATE) or the average treatment effect on the treated (ATT). In the potential outcomes notation, $T_i \in \{0, 1\}$ is unit $i$'s treatment assignment, $Y_i(1)$ is unit $i$'s outcome under treatment, $Y_i(0)$ is its outcome without treatment, and $Y_i(T_i)$ is its observed outcome. The treatment effect on unit $i$ is $Y_i(1) - Y_i(0) \equiv \delta_i$. The ATE is $\mathbb{E}(\delta_i)$ and the ATT is $\mathbb{E}(\delta_i \mid T_i = 1)$.
Both the ATE and the ATT as parameters of interest are only identified if $\delta_i \perp T_i$. That is, unit $i$'s treatment effect needs to be independent of its treatment assignment. Under this assumption, the ATE is also equal to the ATT.
Diff-in-diff relaxes this assumption by working in a two-period setting. With two periods, the potential outcomes are $Y_{it}(1)$ and $Y_{it}(0)$ for time periods $t \in \{0, 1\}$, and the observed outcomes are $Y_{it} \equiv Y_{it}(T_i)$. Let the change in the outcome from the first to the second period be $\theta_i(1) \equiv Y_{i1}(1) - Y_{i0}(1)$ if unit $i$ is assigned the treatment, and $\theta_i(0) \equiv Y_{i1}(0) - Y_{i0}(0)$ if it is not. The treatment effect on unit $i$ is $\delta_i \equiv \theta_i(1) - \theta_i(0)$. Diff-in-diff assumes that $\theta_i(0) \perp T_i$ so that the change in the absence of treatment is the same for the treated units as it is for the untreated. This is called the parallel trends assumption.
Parallel trends are sufficient to identify the ATT because we can calculate $\mathbb{E}(Y_{i1} - Y_{i0} \mid T_i = 1) - \mathbb{E}(Y_{i1} - Y_{i0} \mid T_i = 0)$ from the data and it is equal to
\begin{align*} &\mathbb{E}(Y_{i1}(1) - Y_{i0}(1) \mid T_i = 1) - \mathbb{E}(Y_{i1}(0) - Y_{i0}(0) \mid T_i = 0) = \\ &\quad = \mathbb{E}(\theta_i(1) \mid T_i = 1) - \underbrace{\mathbb{E}(\theta_i(0) \mid T_i = 0)}_{= \mathbb{E}(\theta_i(0) \mid T_i = 1) \;\text{by parallel trends}} = \\ &\quad = \mathbb{E}(\theta_i(1) - \theta_i(0) \mid T_i = 1) = \\ &\quad = \mathbb{E}(\delta_i \mid T_i = 1). \end{align*}
Note that diff-in-diff does allow $\theta_i(1) \not\perp T_i$, so the change under treatment can be different for treated units than for the untreated. It also allows $Y_{i1}(1) - Y_{i1}(0) \not\perp T_i$, so the treatment effect in the second period can be different for treated units than for the untreated.
If we also assumed $\theta_i(1) \perp T_i$, then the ATE, $\mathbb{E}(\delta_i)$, would also be identified. However, in this case the ATE and the ATT would again be equal.
Finding an omitted variable that explains both the outcome and the treatment assignment is a sign that parallel trends might be violated. But this need not be the case. For example, consider $W_{it}$ as the omitted variable and suppose $W_{i0} \not\perp T_i$ and $W_{i0} \not\perp Y_{i0}(0)$. Then $W_{i0} \not\perp Y_{i1}(0) - Y_{i0}(0) = \theta_i(0)$. However, $W_{i0} \not\perp T_i$ and $W_{i0} \not\perp \theta_i(0)$ does not imply that $T_i \not\perp \theta_i(0)$, so the ATT might still be identified and diff-in-diff might still be unbiased. This would be the case if the direction of causality ran from $Y_{it}(0)$ to $W_{i0}$ and from $T_i$ to $W_{i0}$, and not the other way round.
In practice we would think about what our theory tells us about the direction of causality. If the omitted variable was a consequence of the treatment, we would not be concerned. But we would if the omitted variable influenced treatment assignment instead, or if a third factor influenced both the omitted variable and treatment assignment.