This is a complicated question. The simple nearest neighbor matching pairs each observation in the treatment group with a single person in control group who has a similar propensity score. Then you compute the difference in outcome $Y$ for each pair, and then calculate the mean difference across pairs. That's your treatment effect. However, it is also possible to match each treated person with multiple untreated folks. Matching using additional nearest neighbors increases the bias, as the next best matches are necessarily worse matches, but decreases the variance, because more information is being used to construct the counterfactual for each treated person. Different matching estimators differ in how they weight the neighbor(s) in calculating this difference.
One important question is whether you can pair the same control group person with more than one treated person, essentially recycling them. Matching without replacement can yield very bad matches if the number of comparison observations comparable to the treated observations is small. It keeps variance low at the cost of potential bias, while matching with replacement keeps bias low at the cost of a larger variance since you are using the same info over and over. That is another trade-off.
But I digress. Here are some ways to do propensity score matching, in increasing order of complexity:
- The simplest form of matching is using only one control dude who has the closest propensity score (with or without replacement), and calculating the mean difference for all pairs.
- Another strategy is divide the $ps(X)$ into $S$ buckets or intervals. For example, say you have some treated observations with $ps(X)$ between 0.3 and 0.4. Then you take all the control group folks with scores between 0.30 and 0.4 and then use their average $Y$ as the counterfactual. The total treatment effect is $\Sigma_{s}(\bar{Y}_{T=1}-\bar{Y}_{T=0})*w_{s}$, where $w_{s}$ is the the fraction of all treated folks in bucket $s$. For example, you might start with 10 $PS$ buckets and they don't need to have the same width. Note that some treated observations may not have any matches! This is known as the common support problem.
- Yet another way would be to grab all control group members within a fixed radius
of treated unit $i$ and use them as the counterfactuals. Call them group $J_{i}$. The treatment effect is
$\frac{1}{T}\Sigma_{i}(\bar{Y}_{i,T=1}-\bar{Y}_{J})*w_{s}$. The bandwidth problem here takes the form of picking the radius.
- Kernel matching. Here you weight the control group observations who are further away in PS less heavily, maybe not at all.
How do you pick a method? All matching estimators are consistent, because as the sample gets arbitrarily large, the units being compared get arbitrarily close to one another in terms of their characteristics. In finite samples, which one you choose can make a difference. If comparison observations are few, single nearest neighbor matching without replacement is a bad idea. If comparison observations are many and are evenly distributed, multiple nearest neighbor matching will make use of the rich comparison group data.
If comparison observations are many but unevenly distributed (check the PS kernel densities for the two groups), kernel matching is helpful because it will use the additional data where it exists, but not take bad matches where it does not exist.
One complications is that standard errors don't take into account that you estimated the propensity score (since the real thing is not observed), so they are too small. People either ignore this or bootstrap, which may or may not be bad idea.
You basically have to create a wide format dataset with the all the characteristics that are relevant for the matching procedure, perform the matching on this cross-sectional dataset, and then use the ID to identify the matched pair in the panel dataset. Here are some more details:
Use reshape
to create a wide format dataset. Format the pre-treatment variables in the way you want to use them in the matching procedure. You can just take the average of your variables if you have multiple observations for one individual but you can also come up with other ways (you can also keep multiple observations of the same variables such as health1, health2 and use all of them in the matching). The goal is to have a dataset with one observation per individual.
Using this dataset, perform the matching procedure with psmatch2
.
Merge the information about the matched cases with the original dataset. Drop cases that are not matched etc. I am not sure about the details here because I don't really know stata and psmatch2
but I think you get the idea.
Using these steps, you can match cases based on all pre-treatment information and you only have one match per treatment unit.
Best Answer
I do no know of a reference that discusses ANCOVA, but the following paper discusses regression modeling applied to matched data:
Ho, Daniel, Kosuke Imai, Gary King, and Elizabeth Stuart. "Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference." Political Analysis 15 (2007): 199-236.
Chapters 9 and (more particularly) 10 of Gelman & Hill also discuss the topic.