When matching based on the Mahalanobis distance (MD), are there guidelines for selecting the caliper. For example, if the propensity score is used as the distance metric, literature supports starting with 0.2 standard deviations of the logit. Does something similar exist for the MD?
Solved – selecting caliper for Mahalanobis distance matching
causalitydistancematching
Related Solutions
The procedure you described is not propensity score matching but rather propensity score subclassification. In propensity score matching, pairs of units are selected based on the difference between their propensity scores, and unpaired units are dropped. Both methods are popular ways of using propensity scores to reduce imbalance that causes confounding bias in observational studies.
In propensity score matching, the distance between two units is the difference between their propensity scores, and propensity scores are computed from the covariates, so by propensity score matching, you are matching based on a distance measure and covariate values. There are other distance measures that don't involve the propensity score that are frequently used in matching, like the Mahalanobis distance. Some studies show the Mahalanobis distance works better than the propensity score difference as a distance measure and some studies show it doesn't. The relative performance of each depends on the unique characteristics of the dataset; there is no way to provide a single rule that is always true about which method is better. Both should be tried. You can also include the propensity score as a covariate in the Mahalanobis distance.
If your question is more about why we would ever do propensity score subclassification when we could do propensity score matching, there are a few considerations. As before, you should always use whichever method yields the best balance in your sample. Propensity score subclassification may do a better job at achieving balance in some datasets and propensity score matching in others. There is no reason to unilaterally decide to use one method over another. Subclassification allows you to estimate the ATT or ATE, whereas most matching methods only allow the ATT. Subclassification is closely related to propensity score weighting when used in certain ways, whereas matching typically doesn't assign nonuniform weights to individuals. With matching, you can customize the specification more (e.g., by using a caliper, by changing the ratio of controls to treated, etc.), whereas with subclassification the opportunities for customization are more limited. The distinction between matching and subclassification is blurred in the face of full matching, which is a hybrid between the two that often performs better than each. Some papers have compared the performance of the two methods, but as I mentioned before, it is important not to rely on general results and instead try both methods in your sample.
Check out the documentation for the MatchIt
R package which goes into detail on several matching methods and discusses some of their relative merits and methods of customization.
Propensity score methods are one type of method used to adjust for confounding. There are several other methods that rely on different assumptions. Some of the most popular include difference-in-differences, which relies on an assumption about stability over time, and instrumental variable analysis, which relies on an assumption about randomization of some other variable. A third class of methods includes methods that rely on an assumption that all confounding variables have been measured. I highly recommend this 2020 article by Matthay et al. for a comparison of these methods.
Propensity score methods fall in the latter class. Other methods also fall in this class, including regression adjustment, "g"-methods, and doubly-robust methods. These are all different ways of adjusting for confounding by measured covariates by conditioning on them in certain ways. They differ primarily in their statistical performance under various assumptions about the functional form of the treatment and outcome processes.
There are several ways to use propensity scores, including matching (which you described), weighting, subclassification, and regression adjustment, and there are ways to perform each of these methods without propensity scores. I mention all of this so that you see propensity scores as one particular implementation of methods that themselves are members of a broad class of methods that is one of several classes of methods one can use to adjust for confounding. Propensity score methods are not necessarily superior to any of them, and their ubiquity is likely a cultural artifact rather than truly justified by their statistical performance.
Here are a few reasons (and rebuttals) for why propensity score may be popular:
- They are easy to implement (but only in their most basic, poorest performing way; to use them well requires extensive knowledge)
- They are easy to explain to lay audiences (but so are many methods that don't involve propensity scores, like other matching methods)
- They tend to be effective at removing bias due to confounding (but several methods are demonstrably better, especially better than propensity score methods as most commonly used)
- They separate the design and analysis phase, leading to more replicable research and decreasing model dependence (but when used poorly can increase model dependence and are not immune to snooping and nefarious or misguided use)
- They are implemented in most statistical software (but so are many other methods, and they are implemented differently in each software)
- They are a form of dimension reduction in high-dimensional datasets (but there are other ways to reduce dimensionality, and still propensity scores are used even to adjust for a few covariates)
- They rely less on modeling assumptions than regression-based methods (but there are many other methods that also allow for extreme flexibility with often improved performance)
- They sound fancy and make the analyst look sophisticated (but experienced statisticians can easily point out the errors amateur users constantly make)
(You might think I am biased against propensity scores, but check the propensity-scores tag and see my involvement. I'm also the author of several R packages to facilitate the use of propensity score methods.)
In my opinion, propensity scores are overused (or, at best, under-justified) in the medical literature. There are so many better performing and more sophisticated methods that rely on the same assumptions as propensity score methods do that are under-appreciated in medical research, often because the analysts and reviewers in medical research are not familiar with them. I hope to encourage people to consider propensity scores as one option in a vast sea of options, each of which has its own advantages and disadvantages that make it more or less suitable for a given problem. To decide which option is the best for a given problem requires the assistance of a statistician specially trained in the area of causal effect estimation.
Best Answer
As with all matching, pick the caliper that yields the best balance after matching. Also consider the number and range of the remaining treated units. The correct caliper will depend on the characteristics of your data set. I know this doesn't really answer your question, but you should know that the ubiquitous 0.2 standard deviations of the logit of the propensity score is arbitrary.