Econometrics Regression – Best Practices for Post-Double Selection LASSO (pdslasso)

econometricsfeature selectionlassoregression

I'd like to have a clearer idea of the optimal approach to the post-double selection LASSO (paper, webpage). Take data on an RCT with 2 treatment arm dummies $D_1, D_2$ and a potential driver of heterogeneous treatment effects $Z$.

One possibility is to run the PDS lasso on our outcome variable $Y$ and the pooled treatment dummy $D$ and subsequently use the chosen variables in all other regressions, also the ones with potentially different specifications, such as heterogeneous effects wrt $Z$.

On the other hand, if we run PDS lasso for each different specification:

  1. Imagine I want to test the coefficient of the pooled treatment $D$ against the coefficient on treatment 2 $D_2$. Should I run the pds lasso using $D$ as exogenous first to get the coefficient on $D$, and then another pds lasso using $D_1, D_2$ as exogenous to get the coefficient on $D_2$, and then run the test? I feel this is a bit strange, since we're using potentially different controls in each of the regressions instead of testing the difference based on the exact same specification.

  2. Imagine I want to run a regression with heterogeneous treatment effects, such as

$$ Y_i = \beta^\prime X_i + \beta_Z Z_i + \rho_{D} D_i + \rho_{D\cdot Z} D_i \cdot Z_i + u_i$$

Should I also have $D_i \cdot Z_i$ as an exogenous variable to be used in the first step of the variable selection? Also, if I don't use it as an exogenous variable, the standard errors of this coefficient will not be valid in the PDS lasso output. Would I then need to reestimate it in an OLS with the selected variables?

  1. It's preferable to add a small sample correction to the standard errors obtained in pds lasso

I feel that the first option, with one single pds lasso selection of controls for each outcome variable, which then also selects controls for any additional specification we might want to try, seems to make more sense, creating a comparable framework throughout the analysis. Am I missing something?

Best Answer

Let me first briefly summarize the setting: We have a scalar treatment variable $D_i$, a grouping variable $Z_i$ (driver of heterogeneity) and high-dimensional controls $X_i$. $X_i$ can be high-dimensional (i.e. many controls relative to the sample size).

If we ignore treatment effect heterogeneity, our model is simply: $ Y_i = \alpha D_i + X_i'\beta + \epsilon_i $

The model has two parts: a low-dimensional part ($D_i$) and a high-dimensional part comprising all the controls. The aim of the analysis is to estimate the treatment effect $\alpha$ -- we don't really care about the $\beta$ parameter. On the other hand, ignoring $X$ would lead to ommitted variable bias.

The Post Double Selection Lasso approach involves two auxiliary Lasso regressions: $Y$ against $X$, and $D$ against $X$. The union of selected controls gives us our full set of controls, which we will use in the final OLS regression. You can obtain asymptotically valid standard errors for the treatment effect. (This is not so easy for the high-dimensional parameters.)

To your question, which I summarize as How can we accommodate a grouping variable $Z$?

For simplicity, say we have only two groups (male/female) and $Z_i$ is dummy for female. Our model becomes: $ Y_i = \alpha D_i + \alpha_F (D_i Z_i) + X_i'\beta + \epsilon_i $.

Our low-dimensional part now includes two variables. That's perfectly fine, as long as our low-dimensional part doesn't get "too" large relative to the sample size. The PDS algorithm now has three auxiliary Lasso regressions: $Y\rightarrow X$, $D\rightarrow X$, $(DZ)\rightarrow X$. Again, our final OLS regression includes the union of controls. The pdslasso package in Stata allows for multiple treatment/low dimensional variables. So not much to worry about.

Additional comments:

  1. As you say, an alternative, valid approach would be to estimate your model on sub-samples of your data (one estimation for female, one for male). That's more flexible, but also more costly.
  2. One rationale for using Lasso approaches is to allow for non-linear effects. So, depending on the dimension of $X$, I would highly recommend to interact your controls to capture interactions. Also consider higher-level polynomials, splines etc.
  3. Related to the two previous points: If you go for the full sample approach, you should also consider interacting your controls with $Z$. You assume that the treatment affect varies with $Z$. Hence, it also seems plausible that the role of $X$ varies with $Z$.
  4. An alternative valid approach to PDS-Lasso relies on orthogonalization. You would run the same auxiliary Lasso regression, but use the residuals in the final OLS regression. (This is also implemented in pdslasso and referred to as "CHS" (due to Chernozhukov, Hansen, Spindler 2015).) Check the pdslasso help file for more information.
  5. You seem to conflate "exogeneity" and "low vs high-dimensionality". This is not the same.
  6. Addendum: If you have two treatments ($D_1$ and $D_2$) nothing changes. Again, the main constraint is that the low-dimensional part has to be finite and small relative to the sample size.

References:

Related Question