Solved – PSM on panel data , R-square is low at (first stage) logit regression

panel datapropensity-scores

I'm working on a project which embedded in a "PSM + DID" framework, namely, firstly use propenstiy score method to select camparable treatment and control group, and then use DID method to estimated the treatment effect. In my project, the treatment is the fact that manufacture firms purchase land, and the outcome is the firm's investment ratio and loan volume in the following years. My data structure is longitudinal

The problem is , firm buy land in different years. when I compute the propensity score using logit regression at the first stage, I face three choices:The most straightfoward one is , I matched samples within certain year, say , I only consider those firm appearing in 2006 as potential matching pair for firm purchase land in the same year. This approach is locially reasonable. However, in practice, purchasing land is a "very low frequency" event for manufacture firms, so at certain year, only very small fraction of firms engage into that behavior. As a result, even I have controlled comprehensive set of covariants, the R-square is pretty low (well below 0.01) in the logit regression. It's not quite convincing to trust such selection mechanism.

Alternatively, we can pick a "finner strara", say, intially, we search matching pair within the " same year". Can we further restrict the strata to the "same year + same industry category" to increase the explanatory power of logit regression? I haven't tried this approach.

The third choice is , can we define a period longer than one year as the treatment period? Say, we define those firms which purchase land from 2005 to 2007 as treatment group and pick those firms, and then choose conterpart control group in the same period. I have tried this approach and it did improve the R-square. However, it's not so logically convincing.

I also want to refer to the existing literatures on this issue. Please throw me one name if you have ever read one. Thank you so much

PS:Another confusion is , should I do the psm at T year or T-1 year (T refers to the year which treatment is implemented)?

Best Answer

The reason to use propensity scores is to create balanced groups on your set of covariates. Your R-square, the plausibility of your selection model, and any other considerations about the propensity score model are irrelevant, except for its ability to achieve balance on your covariates (as long as you only use covariates measured prior to treatment). This is called the "propensity score tautology" and is described in Ho, Imai, King & Stuart (2007, p.219). The R-square is a poor method of assessing the effectiveness of the propensity score at achieving balance and you should ignore it.

A way I would recommend would be to do a propensity score match separately within each year, then combine the matched datasets across the years for outcome analysis. In your outcome analysis, include year as a covariate.

Regarding your final question: if the treatment actually occurred at year T, then you need to balance on covariates that were measured at year T and before. If for some reason it is possible for covariates measured to be affected by treatment status at year T, you must exclude those from the propensity score model and balancing procedures.

Related Question