Solved – how to handle (many) false positives in training dataset for logistic regression classifier

classificationlogisticmachine learningstatistical-learningtrain

I want to train a logistic regression dataset. I have a quite big training data set ( >100 000) and have around 10 features I can train on. Half of my training data is negative training data and I know for sure that almost all these observations are true negative. But I know for sure that in my positive dataset half of maybe even more then half are false positives. I just dont know which ones.

How can I handle such a problem? Have anyone good tips of pointers to literature?
Thanks in advance

Best Answer

This is a problem called "label noise" in the machine learning literature. There is a nice paper by Bootkrajang and Kaban, called "Learning Kernel Logistic Regression in the Presence of Class Label Noise" (http://www.cs.bham.ac.uk/~axk/Jo_Patrec.pdf) that is probably a good place to start.

Related Solutions

Solved – Binary classifier – dividing dataset into training and evaluation sets

Classifiers usually try to find the best fit for all the data. In the case of imbalance where you have much more negative than positive samples the classifier will pay more attention to the negative class in order to obtain a small overall error. Imbalance can be intrinsic or extrinsic, i.e. intrinsic imbalances are a direct result caused by the nature of the data space (e.g. rare diseases) and extrinsic imbalances are a result of certain limitations (time, space, money, etc.) where the data space is in reality not imbalanced. In addition, it might happen that only either the training or the testing data set are imbalanced. Personally, I would start with stratified cross-validation where it is ensured that the ratio between positive and negative class is the same in each fold and the same as in the overall data set.

To address the imbalance itself there are several methods that do this. A simple way would be to increase the weight of samples from the positive class compared to the negative class, this makes the classifier kind of cost-sensitive. An introduction to all the available methods can be found in

Garcia, E. A. (2009). Learning from Imbalanced Data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263-1284.
Guo, X., Yin, Y., Dong, C., Yang, G., & Zhou, G. (2008). On the Class Imbalance Problem. 2008 Fourth International Conference on Natural Computation (pp. 192-201).

Logistic Regression – Maximizing True Positives Minus False Positives

You don't seem to want logistic regression at all. What you say is "I would like to maximize the difference between true positives and false positives." That is a fine objective function, but it is not logistic regression. Let's see what it is.

First, some notation. The dependent variable is going to be $Y_i$:
\begin{align} Y_i &= \left\{ \begin{array}{l} 1 \qquad \textrm{Purchase $i$ was profitable}\\ 0 \qquad \textrm{Purchase $i$ was un-profitable} \end{array} \right. \end{align}

The independent variables (the stuff you use to try to predict whether you should buy) are going to be $X_i$ (a vector). The parameter you are trying to estimate is going to be $\beta$ (a vector). You will predict buy when $X_i\beta>0$. For observation $i$, you predict buy when $X_i\beta>0$ or when the indicator function $\mathbf{1}_{X_i\beta>0}=1$.

A true positive happens on observation $i$ when both $Y_i=1$ and $\mathbf{1}_{X_i\beta>0}=1$. A false positive on observation $i$ happens when $Y_i=0$ and $\mathbf{1}_{X_i\beta>0}=1$. You wish to find the $\beta$ which maximizes true positives minus false positives, or: \begin{equation} max_\beta \; \sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} - \sum_{i=1}^N (1-Y_i)\cdot\mathbf{1}_{X_i\beta>0} \end{equation}

This is not an especially familiar objective function for estimating a discrete response model, but bear with me while I do a little algebra on the objective function: \begin{align} &\sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} - \sum_{i=1}^N (1-Y_i)\cdot\mathbf{1}_{X_i\beta>0}\\ = &\sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} - \sum_{i=1}^N \mathbf{1}_{X_i\beta>0} + \sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0}\\ = &\sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} - \sum_{i=1}^N \mathbf{1}_{X_i\beta>0} + \sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} \\ & \qquad + \sum_{i=1}^N 1 - \sum_{i=1}^N 1 + \sum_{i=1}^N Y_i - \sum_{i=1}^N Y_i\\ = &\sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} + \sum_{i=1}^N (1-Y_i)(1-\mathbf{1}_{X_i\beta>0}) - \sum_{i=1}^N 1 + \sum_{i=1}^N Y_i \\ \end{align}

OK, now notice that the last two terms in that sum are not functions of $\beta$, so we can ignore them in the maximization. Finally, we have just shown that the problem you want to solve, "maximize the difference between true positives and false positives" is the same as this problem: \begin{equation} max_\beta \; \sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} + \sum_{i=1}^N (1-Y_i)(1-\mathbf{1}_{X_i\beta>0}) \end{equation}

Now, that estimator has a name! It is named the maximum score estimator. It is a very intuitive way to estimate the parameter of a discrete response model. The parameter is chosen so as to maximize the number of correct predictions. The first term is the number of true positives, and the second term is the number of true negatives.

This is a pretty good way to estimate a (binary) discrete response model. The estimator is consistent, for example. (Manski, 1985, J of Econometrics) There are some oddities to this estimator, though. First, it is not unique in small samples. Once you have found one $\beta$ which solves the maximization, then any other $\beta$ which makes the exact same predictions in your dataset will solve the maximization---so, infinitely many $\beta$s close to the one you found. Also, the estimator is not asymptotically normal, and it converges slower than typical maximum likelihood estimators---cube root $N$ instead of root $N$ convergence. (Kim and Pollard, 1990, Ann of Stat) Finally, you can't use bootstrapping to do inference on it. (Abrevaya & Huang, 2005, Econometrica) There are some papers using this estimator though---there is a fun one about predicting results in the NCAA basketball tournament by Caudill, International Journal of Forecasting, April 2003, v. 19, iss. 2, pp. 313-17.

An estimator that overcomes most of these problems is Horowitz's smoothed maximum score estimator (Horowitz, 1992, Econometrica and Horowitz, 2002, J of Econometrics). It gives a root-$N$ consistent, asymptotically normal, unique estimator which is amenable to bootstrapping. Horowitz provides example code to implement his estimator on his webpage.

Best Answer

Related Solutions

Solved – Binary classifier – dividing dataset into training and evaluation sets

Logistic Regression – Maximizing True Positives Minus False Positives

Related Question