Logistic Regression – Applying to a Biased Dataset

I have collected a binary classification dataset in a somewhat biased way:

I have thousands of unlabeled samples.
A small percentage of these samples belong to the positive class.
I know for a fact that my regressors are strongly correlated with the target.
Using this fact I can collect positive samples by picking samples whose regressors are in the top end of the values.
This way I can collect enough positive samples, and randomly pick the rest to get negative samples.
I don't have the luxury to label many samples, yet I need enough positive samples to get an ok model.

However, I dont know the true percentage of positive samples. When fitting a logistic regression afterwards, I notice that my output is not very well calibrated (i.e. I need to tune the threshold to output a positive classification), probably because I collected the dataset in such a biased way.

I suspect that if I weighed my labeled samples using a density estimate of the regressors on the whole data this could have some sort of "importance sampling" effect, but I am not sure.

What can I do to improve model calibration?

Best Answer

One way to make this work is to have a well-defined set of probabilities. At the simplest level, you have 'high-yield' observations with values of the regressors where you expect a positive result, and 'low-yield' observations where you don't. If you sample randomly from high-yield observations with probability $p_H$ (which could even be $p_H=1$) and randomly from low-yield observations with probability $p_L\ll p_H$, you'll get a sample that's enriched with positive outcomes. A weighted regression using $1/p_L$ as sampling weights for the low-yield observations and $1/p_H$ as sampling weights for the high-yield observations will reconstruct the true population relationships. You can have more complicated setups where there are more than two groups: all you need is that the probabilities really are the sampling probabilities you use, and that the probability is non-zero for every individual in the population. Oversampling higher-yield individuals is a standard design approach in survey sampling.

If your sampling depends only on regressors that are actually in the model, and your model is correctly specified, you don't even need the weights, because whether or not an observation is in the sample is independent of the outcome conditional on the regressors. That's quite a strong set of assumptions, though.

Best Answer

Related Solutions

Solved – Incorporate new unlabeled data into classifier trained on a small set of labeled data

Solved – Use clustering to create labels of unlabeled data and then classify a test set (available or not in the clustering)

Related Question