Bivariate Probit Model – Handling Sample Selection and Bias in Econometrics

biasbivariateeconometrics

Could you please provide an example and explanation why to use the bivariate probit model with sample selection?

In this context, to what sample selection bias refers to?

Best Answer

Unfortunately I lost the tex file for these notes, but they are only two pages, so I added screenshots:

I have a paper where we use this approach to look at what happens to bidders who lose to a sniper in their very first auction on eBay. A sniper is another participant who tries to place a bid in the final seconds of sequential ascending auctions with predetermined ending times. The outcome $y_1$ is binary: leaving the auction platform or not. The sniped dummy $y_2$ is in the outcome equation of $y_1$.

The reason you can't just put $y_2$ as a regressor is that sniping is more likely to occur in markets where there are few bidders. It is these kind of markets for which a marketplace like eBay is most attractive to buyers, implying that bidders in these markets may be more likely to return to eBay. Hence, a positive correlation between sniping and auction thinness, and a positive correlation between auction thinness and the likelihood of returning to eBay, will bias downward any effect that sniping has on bidders ceasing to bid in auctions. That is, there is selection into which auctions gets sniped.

We use a recursive bivariate probit strategy to address this concern. There are two ways that bidding occurs on eBay. First, bidders can manually insert their bid into the proxy bidding system. Second, bidders can use sniping software that does this automatically in the last seconds of the auction without their attentiveness. At nighttime, there are fewer manual bidders active on the site, and consistent with this we observe that more auctions are won by snipers. However, 10pm in New York is only 7pm in San Francisco, while 10pm in San Francisco is 1am in New York. Therefore the 10pm San Francisco bidder is much more likely to be sniped than a 10pm New York bidder. If these bidders are otherwise comparable conditional on observables, then one can use their respective time zones as an instrument for variation in the likelihood of being sniped. This is the basis of our identification strategy.

Reference:

Matt Backus, Tom Blake, Dimitriy V Masterov, and Steven Tadelis, "Is Sniping A Problem For Online Auction Markets?", Proceedings of the 24th International Conference on World Wide Web, 88-96.

Response to Question:

Here's a toy example illustrating the fundamental problem with selection, setting aside the bivariate probit stuff. The 250 circles below correspond to people with different levels of education and their potential wage offers. Suppose anyone who gets a wage offer of \$15 or less decides that he would rather go hiking instead of working, so we don't get to see his wage (orange circles). If you fit a linear model to the remaining data (navy circles), the slope will be 20% smaller than on the full sample, so the benefit of going to school will appear $0.50 lower than it really is. One way to think about this is the people with large negative epsilons are more likely to be missing in the low education groups, so it inflates the observed wages in those groups, which tilts the regression line and makes schooling look less effective.

Why does this matter? Let's focus on the fifth highest orange circle with 11 years of education, who may be right on the margin between dropping out of high school (we don't observe his costs here). His offer with 11 years of education is just under \$13. If he thinks the benefit of another year is \$2, he may leave school because he can just start hiking now and not incur the unnecessary cost. Since economists are interested in policy questions (like what would be the net benefit of college loan forgiveness), ignoring the people who went hiking (or aren't in the labor force) would be a poor choice. Using the wrong estimate could lead to some suboptimal investment in education, both individually and socially.

If your goal is to predict what the wage among workers with X years of education is, using the worker data would be OK. It is when you want to make causal statements about what would have happened to someone had they completed more school that you need to worry about selection.

Data:

x   y
10  36.77875
13  29.92348
12  17.84871
10  21.6781
12  10.68797
8   29.45379
12  12.50187
9   22.7946
11  20.16943
12  34.30902
11  34.29064
13  15.70758
12  11.20882
12  33.84629
10  29.01311
14  22.38047
12  54.72863
11  49.56858
15  24.02602
13  36.42536
12  22.71795
10  19.54785
14  38.4038
14  34.30227
10  19.37613
11  2.086503
14  26.4395
9   14.80535
14  26.08193
11  30.91514
13  32.0592
8   34.08197
12  28.76042
13  38.68304
13  47.95863
11  24.46299
14  30.65527
16  54.57944
10  13.10431
14  25.30962
9   32.48787
11  24.64828
12  25.96807
11  16.65392
12  36.22239
14  25.20041
12  17.36436
12  38.27636
11  24.94589
10  31.49921
13  25.5742
12  25.78094
15  45.46352
11  21.08684
13  12.91339
11  33.41261
14  25.76663
14  49.73616
11  22.67634
16  55.26606
12  33.48164
15  33.87222
11  16.43427
10  21.37041
13  29.18699
9   20.20561
10  44.55228
13  47.68126
11  27.97073
12  36.06765
12  35.84951
11  11.26081
15  36.36755
16  23.63187
13  41.6813
12  30.994
13  31.27638
15  38.53747
11  48.27272
12  25.59191
16  44.45938
14  45.71571
14  33.68782
10  33.39376
12  45.53596
13  27.69209
12  26.27091
11  33.25354
11  16.89751
13  29.82576
11  38.67755
12  37.91254
12  32.57379
15  44.98801
8   13.68349
14  37.57533
13  15.75075
13  26.164
11  22.16672
10  30.29593
13  28.82244
17  43.92926
10  3.793436
14  54.33921
11  30.367
15  33.27439
15  11.65642
11  24.98503
11  35.55489
12  12.33667
9   19.50787
9   29.07384
13  39.28975
10  18.6426
13  32.62035
11  39.59964
12  38.74402
13  29.84206
11  28.70477
15  27.76243
12  35.0229
9   10.48161
13  32.94176
13  32.26461
13  20.64163
12  28.0451
11  30.72115
12  36.0846
16  52.26955
14  49.25191
15  48.60603
11  43.55553
14  45.27725
8   21.87545
12  27.19747
14  26.53179
9   18.49253
11  8.361289
12  30.11271
14  34.79089
11  39.50394
11  28.82289
12  29.17985
14  29.04272
12  39.28788
16  49.37336
14  44.76436
13  26.92885
11  39.19863
10  26.79517
14  30.74782
15  36.4241
13  27.07322
8   6.324262
10  29.13576
12  20.5755
10  5.107344
13  29.96143
12  30.87452
11  39.13571
12  31.89077
9   10.86587
9   32.571
13  36.32444
12  22.86062
13  31.54544
14  28.71825
12  35.49273
12  18.8854
8   32.67909
13  46.02027
10  35.3645
10  38.09469
11  21.16621
13  37.33644
14  43.37357
11  16.6026
13  36.44201
13  31.3522
11  35.2338
11  26.09209
14  52.26656
10  30.28688
12  36.23479
10  20.65876
11  42.26967
12  33.70919
14  34.77951
12  22.55819
18  54.19498
10  30.39797
15  33.097
11  27.93202
15  32.32858
14  37.80832
15  53.04989
13  35.48972
12  36.6966
11  36.46019
16  39.18628
14  39.73043
12  29.34012
11  24.64899
10  29.82037
11  29.40948
15  27.87559
10  34.42297
12  46.14198
9   23.11894
13  32.52744
16  43.73385
12  61.2392
10  26.99522
13  31.45339
14  36.8537
10  26.05906
15  40.28445
10  31.6895
14  33.68509
11  28.2475
12  10.36757
11  39.6512
11  41.26844
9   16.99724
12  29.98223
13  41.83167
10  23.86976
11  17.47382
12  41.34653
11  21.89777
12  14.88964
10  8.632276
11  13.11433
11  1.543627
15  48.46469
14  25.4514
16  55.7856
10  30.88486
16  23.78511
14  46.50201
12  29.67507
14  31.63361
9   44.99561
10  36.27057
13  32.96661
13  14.41626
13  46.88454
15  40.52191
16  39.17714
12  33.70162

Best Answer

Related Solutions

Solved – Seemingly unrelated bivariate probit for endogeneity: interpretation of Rho

Econometrics – Understanding the Heckman Selection Model and Negative Rho

Related Question