Logistic Regression – How to Cope with Missing Data in Logistic Regression

censoringlogisticmissing dataregression

I'm investigating optimal bidding in auctions, and am using logistic regression to predict the probability of winning an auction given a set of explanatory variables (e.g. the price I bid, number of competing bids etc).

One explanatory variable I want to use is the second highest price that was paid. However, by the design of the auction, I only observe the second highest price paid when I am the highest bidder (i.e. when I win the auction).

This missing data is a major issue as my dataset indicates that there is a winning bid only ~20% of the time, hence I don't know the second highest price paid 80% of the time. Yet intuitively, I don't want to drop this variable as it seems to me knowledge of the second highest bid is extremely valuable in determining my chances of being the winning bid.

Thus are there any standard methods to cope with this kind of missing data for logistic regression?

Best Answer

I am afraid you cannot expect to find some "canned" solution to your problem. Most methods for handling missing data assumes "missing at random" or even "missing completely at random" (you can google those terms!). Your problem seems definitely to be a problem of informative missingness. Then you will need to model the mechanism of missingness, and maybe model the "second highest bid" as a response, given some covariables (which might include the winning bid).

From there you can try to build a custom model. You can google for "informative missingness" to get some ideas.

Related Question