Feature Selection – Applying Feature Selection and Propensity Score Matching

causalitydifference-in-differencefeature selectionmatchingpropensity-scores

After reading the section on variable selection in OHDSI for population-level estimation effects, I set out to add additional covariates to my process. As suggested, I began looking at implementing regularized regression (elastic net) for what I eventually realized was basic feature selection.

To clarify, the goal of my PSM implementation is to control for confounding effects for the calculation of ATT and eventually ITE.

When I've approached this topic before, it seemed unclear (to provide a single example) and ultimately descends into a larger debate about using preprocessing and model evaluation techniques commonly found in prediction tasks verses what I'm trying to do: control for confounding for causal inference. Even Stack suggested I start there

But my question is, what strikes the best balance? Does increased PS "accuracy" better control for confounding? More specifically:

  1. Should highly correlated features be removed prior to generating the propensity score? Afterall, PSM at its core is a logistic regression. Yet, that step does not appear common
  2. If we are or should be performing variable selection, are all methodologies on the table? It would seem that we'd want to use the highest performing method, be it RFE using decision trees or regularized regression.
  3. Per the above, does this require inclusion of goodness-of-fit / discrimination metrics for the resulting PS in subsequent reports to "validate" selected features? I can't imagine this being helpful to all but the most esoteric amongst us.
  4. If the above are true, shouldn't we also consider more advanced algorithms such as neural nets* for PS specification? This would also alleviate some of the correlation / selection issues from above, but it does not seem to be a popular method.

Ultimately, if the resulting population is balanced prognostically does a better PS specification matter? PS matching seems to occupy a gray space as related to its use of regression. I've been content to explain away some of this incongruity as "prediction is not the goal, controlling for confounding is" and put it to bed with the larger "Explain v. predict" conversation. I'm simply trying to identify the most robust process without being superfluous.

*Please don't mistake me for someone trying to throw a neural net at a simple regression problem.

Best Answer

The goal of propensity score matching (PSM) is to adjust for confounding by achieving covariate balance on a sufficient set of covariates required to nonparametrically identify the causal effect. Covariate balance is the degree to which the treatment is independent of the covariates, or, equivalently, how similar the covariate distributions are between the treatment groups. The set of variables required to nonparametrically identify the causal effect (i.e., a sufficient adjustment set) is a theoretical matter that cannot be decided by statistical modeling and requires the use of substantive beliefs about the relationship among the treatment, outcome, and covariates. I discuss some of that here.

Propensity scores and the models used to estimate them are not to be interpreted and therefore should not be parsimonious. Unlike most prediction tasks, propensity score estimation is not about achieving the best accuracy in predicting the probabilities of class membership; rather, it is about finding propensity scores that yield the best balance. Therefore, propensity score models should not be evaluated on their predictive performance but rather on their ability to achieve balance.

There are many ways to estimate propensity scores, and none can be known to be superior out the outset; the best one is the one that achieves the best balance, so many should be tried. It may be that an elastic net propensity score model yields the best balance, but the fact that such a model is performing variable selection is irrelevant. It does not tell you anything about which variables need to be controlled for and balanced by the matching. It is solely one of many possible propensity score models. A variable being selected out of the final propensity score model is not a variable that no longer needs to be adjusted for; it is just a variable that, when removed, yields the best-performing model. When there are very many covariates and a small treatment group, it is often the case that the best-performing models will involve regularization or variable selection. But the results of such models do not inform on any substantive issue related to the problem at hand.

With this in mind, I will answer your four questions:

  1. Typically, there is no reason to remove correlated covariates prior to estimating the propensity score. Highly correlated variables don't really affect the predicted values, which are all that are used in PSM. It may happen to be the case that a model without the correlated covariates yields the best balance, but there is neither a rule nor a suggestion to remove correlated covariates.
  2. All prediction methods are on the table for estimating propensity scores, whether they involve variable selection or not. Random forests, boosted trees, neural nets, regularized regression, and many other methods are common in estimating propensity scores and are implemented in several of the major propensity score analysis R packages. Again, the way to judge the performance of the propensity score model is to assess the degree to which matching on the resulting propensity score yields covariate balance, which differs from the common model evaluation procedures in predictive modeling outside causal effect estimation. See my answers here and here for more thorough discussion on this.
  3. The measures that should be reported are balance measures. Balance should be assessed and reported broadly by comparing many features of the distributions of the covariates between the treatment groups after matching. Again, the variables that need to be balanced are those required to identify the causal effect, not those that happen to be selected by a given propensity score model. See Austin (2009) and the cobalt documentation for good balance metrics to report.
  4. Neural nets can absolutely be used. They are less commonly used than other methods because they are more complicated and require some degree of expertise to use. Collier et al. (2021) explain some of the relevant complexities. Their use in PSM has been written about less than the use of generalized boosted models and logistic regression, and so these methods are more popular. PSM with a neural net propensity score is implemented in the R package MatchIt, so it is available. It is also possible to use stacking methods like Superlearner to compute propensity scores (Alam et al., 2019). For some problems, simple models like logistic regression may perform well; for others, more complex models or models that involve regularization or variable selection may perform better. Some matching methods don't even involve the propensity score, like cardinality matching and coarsened exact matching (both implemented in MatchIt).

Alam, S., Moodie, E. E. M., & Stephens, D. A. (2019). Should a propensity score model be super? The utility of ensemble procedures for causal adjustment. Statistics in Medicine, 38(9), 1690–1702. https://doi.org/10.1002/sim.8075

Austin, P. C. (2009). Balance diagnostics for comparing the distribution of baseline covariates between treatment groups in propensity-score matched samples. Statistics in Medicine, 28(25), 3083–3107. https://doi.org/10.1002/sim.3697

Collier, Z. K., Leite, W. L., & Zhang, H. (2021). Estimating propensity scores using neural networks and traditional methods: A comparative simulation study. Communications in Statistics - Simulation and Computation, 0(0), 1–16. https://doi.org/10.1080/03610918.2021.1963455

Related Question