I'd like to have a clearer idea of the optimal approach to the post-double selection LASSO (paper, webpage). Take data on an RCT with 2 treatment arm dummies $D_1, D_2$ and a potential driver of heterogeneous treatment effects $Z$.
One possibility is to run the PDS lasso on our outcome variable $Y$ and the pooled treatment dummy $D$ and subsequently use the chosen variables in all other regressions, also the ones with potentially different specifications, such as heterogeneous effects wrt $Z$.
On the other hand, if we run PDS lasso for each different specification:
-
Imagine I want to test the coefficient of the pooled treatment $D$ against the coefficient on treatment 2 $D_2$. Should I run the pds lasso using $D$ as exogenous first to get the coefficient on $D$, and then another pds lasso using $D_1, D_2$ as exogenous to get the coefficient on $D_2$, and then run the test? I feel this is a bit strange, since we're using potentially different controls in each of the regressions instead of testing the difference based on the exact same specification.
-
Imagine I want to run a regression with heterogeneous treatment effects, such as
$$ Y_i = \beta^\prime X_i + \beta_Z Z_i + \rho_{D} D_i + \rho_{D\cdot Z} D_i \cdot Z_i + u_i$$
Should I also have $D_i \cdot Z_i$ as an exogenous variable to be used in the first step of the variable selection? Also, if I don't use it as an exogenous variable, the standard errors of this coefficient will not be valid in the PDS lasso output. Would I then need to reestimate it in an OLS with the selected variables?
- It's preferable to add a small sample correction to the standard errors obtained in pds lasso
I feel that the first option, with one single pds lasso selection of controls for each outcome variable, which then also selects controls for any additional specification we might want to try, seems to make more sense, creating a comparable framework throughout the analysis. Am I missing something?
Best Answer
Let me first briefly summarize the setting: We have a scalar treatment variable $D_i$, a grouping variable $Z_i$ (driver of heterogeneity) and high-dimensional controls $X_i$. $X_i$ can be high-dimensional (i.e. many controls relative to the sample size).
If we ignore treatment effect heterogeneity, our model is simply: $ Y_i = \alpha D_i + X_i'\beta + \epsilon_i $
The model has two parts: a low-dimensional part ($D_i$) and a high-dimensional part comprising all the controls. The aim of the analysis is to estimate the treatment effect $\alpha$ -- we don't really care about the $\beta$ parameter. On the other hand, ignoring $X$ would lead to ommitted variable bias.
The Post Double Selection Lasso approach involves two auxiliary Lasso regressions: $Y$ against $X$, and $D$ against $X$. The union of selected controls gives us our full set of controls, which we will use in the final OLS regression. You can obtain asymptotically valid standard errors for the treatment effect. (This is not so easy for the high-dimensional parameters.)
To your question, which I summarize as How can we accommodate a grouping variable $Z$?
For simplicity, say we have only two groups (male/female) and $Z_i$ is dummy for female. Our model becomes: $ Y_i = \alpha D_i + \alpha_F (D_i Z_i) + X_i'\beta + \epsilon_i $.
Our low-dimensional part now includes two variables. That's perfectly fine, as long as our low-dimensional part doesn't get "too" large relative to the sample size. The PDS algorithm now has three auxiliary Lasso regressions: $Y\rightarrow X$, $D\rightarrow X$, $(DZ)\rightarrow X$. Again, our final OLS regression includes the union of controls. The
pdslasso
package in Stata allows for multiple treatment/low dimensional variables. So not much to worry about.Additional comments:
pdslasso
and referred to as "CHS" (due to Chernozhukov, Hansen, Spindler 2015).) Check thepdslasso
help file for more information.References: