You basically have to create a wide format dataset with the all the characteristics that are relevant for the matching procedure, perform the matching on this cross-sectional dataset, and then use the ID to identify the matched pair in the panel dataset. Here are some more details:
Use reshape
to create a wide format dataset. Format the pre-treatment variables in the way you want to use them in the matching procedure. You can just take the average of your variables if you have multiple observations for one individual but you can also come up with other ways (you can also keep multiple observations of the same variables such as health1, health2 and use all of them in the matching). The goal is to have a dataset with one observation per individual.
Using this dataset, perform the matching procedure with psmatch2
.
Merge the information about the matched cases with the original dataset. Drop cases that are not matched etc. I am not sure about the details here because I don't really know stata and psmatch2
but I think you get the idea.
Using these steps, you can match cases based on all pre-treatment information and you only have one match per treatment unit.
All matching estimators for the treatment on the treated effect can be written in the form
$$ \frac{1}{n_T} \sum_{i \in \{d_i=1\}} \left[ y_{1i} - \sum_{j \in \{d_j = 0 \}} w_{ij} \cdot y_{0j} \right] ,$$
where $w_{ij}$ is the weight placed on the $j$th untreated observation as a counterfactual for the $i$th treated observation, and $n_T$ is the number of treated persons. The weights satisfy $\sum_j w_{ij}=1$ for all $i$.
Effectively, from each treated observation $i$, you subtract a weighted average of the control observations. Then you take the average of these differences. These weights are specific to observation $i$. Different matching estimators differ in how they construct the weights.
For example, nearest neighbor matching sets the weight to 1 for the single untreated observations closest to $i$ in terms of the propensity score and to 0 for all others. k-NN uses $k$ closest neighbors instead.
Interval matching consists of dividing the range of propensity scores into a fixed number of intervals (which need not be of equal length). An interval-specific estimate is obtained by taking the difference between the mean outcomes of the treated and untreated units in each interval.
Radius/caliper matching takes the mean of the outcomes for untreated units within a fixed radius of each treated unit as the estimated expected counterfactual. You pick the radius.
Kernel matching uses weights that decline with the PS distance. You can think about kernel matching as running a weighted regression for each treated observation using the comparison group data and the regression includes only an intercept term. Here you have to pick the kernel and the bandwidth. Larger bandwidth means further observations will have larger weights.
Local linear matching is very similar, but also included a linear term in PS. Some people will also include higher order polynomial terms.
Finally, you have inverse probability weighting. The basic idea is that you can figure out the expected untreated outcome (in either the treated population or the full population) by reweighting the observed values using the treatment probabilities.
There are some guidelines about how to pick a method here.
There is a list of software and packages that can do matching here. Stata also now has native PSM estimators. In my experience, replicating the output by hand is often very hard once you go past the simplest estimators. However, you can also find examples with output for all of these online, so even if you don't have the software, they will give you a useful benchmark since you can usually track down the data.
Best Answer
Saying the propensity score matching estimated quantity is a counterfactual does not make it counterfactual.
According to Pearl, ATT is a counterfactual estimand. This means you cannot find it without knowing ATE (the causal effect) first. Pearl's way to find a counterfactuals is by finding ATE in a SCM setting, replacing the treatment with the alternative treatment (e.g. control) in the equations, and recalculating the outcome from the equations to find the counterfactual effect; i.e., you find the residuals from the causal estimation and use them in the counterfactual.
I think predicting ATT directly from the data using propensity score methods amounts to arbitrarily choosing controls that are not identical to the treatment units; they are just similar, and we do not know how similar. They are similar in the confounding causes, but not similar in the other and unknown causes of the outcome.
A safer method is to find ATE using propensity matching and then use it to find ATT in an independent step using SCM. To me this is more logical. See Pearl's Primer for more details.