Solved – how to handle (many) false positives in training dataset for logistic regression classifier

classificationlogisticmachine learningstatistical-learningtrain

I want to train a logistic regression dataset. I have a quite big training data set ( >100 000) and have around 10 features I can train on. Half of my training data is negative training data and I know for sure that almost all these observations are true negative. But I know for sure that in my positive dataset half of maybe even more then half are false positives. I just dont know which ones.

How can I handle such a problem? Have anyone good tips of pointers to literature?
Thanks in advance

Best Answer

This is a problem called "label noise" in the machine learning literature. There is a nice paper by Bootkrajang and Kaban, called "Learning Kernel Logistic Regression in the Presence of Class Label Noise" (http://www.cs.bham.ac.uk/~axk/Jo_Patrec.pdf) that is probably a good place to start.

Related Question