When your treatment status depends on (fully!) observed covariates the parallel trends assumption does not depend anymore on the unconditional but the conditional pre-treatment trends. If you were to state the identification assumption in a paper, you would write something like:
The identifying assumption is that pre-treatment trends in the outcome
for treated and untreated households are parallel conditional on
observables that determine treatment status. This means that treated
and control households' outcomes with similar characteristics would
have continued in a parallel way in the post-treatment period in the
absence of the treatment.
Since you only have one pre-treatment period it is somewhat odd to speak about pre-treatment trends though. There is no way that you can infer those from a single data point but the required condition would be the one cited. For your case you are conditioning on the baseline period ($t=0$) characteristics.
Typically it is very difficult to sell a difference in differences framework if you cannot show these trends but if this is for an assignment or thesis work then it should be sufficient that you demonstrate your awareness of this problem. Even though it is not asked for in the question I still want to hint to one or two more points that you might want to think about:
- migration is a decision of the household and I highly doubt that this decision is fully captured by observables; in this case you might want to instrument the treatment status with some policy change (or another valid instrument) that affects the migration decision but not the educational outcomes
- depending on the country you are looking at, education is bounded below by minimum school leaving ages. Suppose your treatment would have a negative effect on years of education you may not be able to find this causal effect if observed educational outcomes cannot go any lower in response to treatment because of compulsory schooling laws
As a minor point you also may want to revisit the subscripts for mother's and father's education. Typically those do not change anymore after people have completed their formal education, i.e. they shouldn't have a time subscript.
The contrast (i.e., estimand) of interest in diff-in-diff is $\color{red}{E[Y^1_{post}|A=1]} - \color{blue}{E[Y^0_{post}|A=1]}$, which relies on the unobserved quantity $\color{blue}{E[Y^0_{post}|A=1]}$. How can we get this quantity if it is unobserved?
The parallel trends assumption is a counterfactual assumption about $\color{blue}{E[Y^0_{post}|A=1]}$, the mean potential outcome in the post-period for the treated units had they instead received control. The assumption can be stated as follows:
$$\color{blue}{E[Y^0_{post}|A=1]}-\color{green}{E[Y^0_{pre}|A=1]} = \color{darkorange}{E[Y^0_{post}|A=0]}-\color{brown}{E[Y^0_{pre}|A=0]}$$
The quantity on the left is the trend in the potential outcomes under control (i.e., difference between outcomes post and pre) for the treated units, and the right side is the trend in the potential outcomes under control for the control units. The parallel trend assumption states that these two trends are equal (i.e., parallel if plotted). See the graph below, which colors the dots corresponding to the quantities they represent:
The dotted line represents the counterfactual trend under control for the treated units. The solid lines represent the observed trends. The parallel trends assumption is that the dotted line is parallel with the bottom solid line.
The assumption is fundamentally untestable because there is no data for $\color{blue}{E[Y^0_{post}|A=1]}$; for the treated units in the post-period, we only observe their potential outcomes under treatment (i.e., $\color{red}{E[Y^1_{post}|A=1]} = \color{red}{E[Y_{post}|A=1]}$).
It is important to note that the terms on the right side are observed; they are simply the observed outcome means in the control group before and after treatment. We still don't have $\color{green}{E[Y^0_{pre}|A=1]}$; to get this, we need the assumption
$$
\color{green}{E[Y^0_{pre}|A=1]} = \color{green}{E[Y^1_{pre}|A=1]}
$$
That is, the pre-period outcomes don't depend on the treatment you end up receiving (i.e., because the future can't affect the past). This quantity is also observed; it's just the average outcome in the treated group in the pre-period.
So now, thanks to the parallel trends assumption, we can write
\begin{align}
\color{blue}{E[Y^0_{post}|A=1]} &= \color{green}{E[Y^0_{pre}|A=1]} + \color{darkorange}{E[Y^0_{post}|A=0]} - \color{brown}{E[Y^0_{pre}|A=0]} \\
&= \color{green}{E[Y^1_{pre}|A=1]} + \color{darkorange}{E[Y^0_{post}|A=0]} - \color{brown}{E[Y^0_{pre}|A=0]} \\
&= \color{green}{E[Y_{pre}|A=1]}+\color{darkorange}{E[Y_{post}|A=0]}-\color{brown}{E[Y_{pre}|A=0]}
\end{align}
where the last line is made up solely of observed quantities.
Finally, we can write the counterfactual estimand as
\begin{align*}
\color{red}{E[Y^1_{post}|A=1]} - \color{blue}{E[Y^0_{post}|A=1]} &= \color{red}{E[Y_{post}|A=1]} - \\
& \qquad (\color{green}{E[Y_{pre}|A=1]} + \color{darkorange}{E[Y_{post}|A=0]} - \color{brown}{E[Y_{pre}|A=0]}) \\
&= (\color{red}{E[Y_{post}|A=1]} - \color{green}{E[Y_{pre}|A=1]})- \\
& \qquad (\color{darkorange}{E[Y_{post}|A=0]}-\color{brown}{E[Y_{pre}|A=0]})
\end{align*}
which is precisely the diff-in-diff observed variables estimand. That is, to be able to write the counterfactual estimand as a contrast among observed quantities, we need the parallel trends assumption because it links the counterfactual quantities to the observed quantities. It is an essential assumption for diff-in-diff and the whole motivation behind the methodology. In theory it's a much more plausible assumption than strong ignorability or the exclusion restriction for instrumental variables, which is why diff-in-diff is such a powerful method.
Best Answer
For your first point, plotting the average of the outcome for the treatment and control group over time would be the right thing to do in order to see the unconditional evolution of the outcomes in both groups over time. Your statement that you are essentially using between player variation is not correct though. When using difference-in-differences (DiD) you are essentially comparing the group averages. In a simple setting with one pre- and one post-treatment period you can compute the DiD coefficient as $$ \delta_{did} = \left[ E(y_{it}|g=1,t=1)-E(y_{it}|g=1,t=0) \right] - \left[ E(y_{it}|g=0,t=1)-E(y_{it}|g=0,t=0) \right] $$ where $g=1$ is the treatment group, and $t=1$ is the post-treatment period (see here for further explanation). So given that the computation of the treatment effect ultimately happens at the group level, plotting the group averages over time is a good indication for whether the parallel trends assumption holds.
For your second question, doing the procedure in the answer you linked is actually very similar to including placebo treatments. Suppose you have time periods $t = 1, 2,...,k,...,T$ periods where the treatment happens between $k$ and $k+1$ (so time $k$ is your last pre-treatment period). In your setting, you could run the following regression: $$ y_{it} = \text{individual Fe}_i + \text{time FE}_t + \sum_{j\neq k} \delta_j \left( \text{treatmentgroup}_i \cdot I(t=j) \right) + X'\gamma + \epsilon_t $$
i.e. you are interacting the treatment group indicator with time dummies for all periods except for period $k$ (because you need to leave one interaction out as otherwise there will be perfect multicollinearity) which are the $I(\cdot)$ terms in the regression equation.
Then all the $\delta_j$ with $j<k$ are placebo tests for whether the treatment had an effect on the outcome between the two groups. This should not happen because if the treatment has an effect before it even occurs, then this casts doubts on the parallel trends assumption. Plotting these coefficients is basically the "conditional" outcome distribution plot as compared to plotting the unconditional outcome evolution over time as discussed for point 1.
The nice thing is that the $\delta_j$ coefficients with $j>k$ then show you how the treatment effect evolves over time, i.e. how long it takes to fade away or whether it is persistent. An example of how such a coefficient plot would look like is shown below. There is a nice command available in Stata (and I think also in R) called
coefplot
which does this for you.Here the omitted time period is 1940 which is the last pre-treatment period). None of the coefficients before are statistically significant and afterwards you see a permanent effect of the treatment (in this particular case).