Logistic Regression – Maximizing True Positives Minus False Positives

classificationglmnetlogisticrregression

I have a logistic regression model (fit via glmnet in R with elastic net regularization), and I would like to maximize the difference between true positives and false positives. In order to do this, the following procedure came to mind:

  1. Fit standard logistic regression model
  2. Using prediction threshold as 0.5, identify all positive predictions
  3. Assign weight 1 for positively predicted observations, 0 for all others
  4. Fit weighted logistic regression model

What would be the flaws with this approach? What would be the correct way to proceed with this problem?

The reason for wanting to maximize the difference between the number of true positives and false negatives is due to the design of my application. As part of a class project, I am building a autonomous participant in an online marketplace – if my model predicts it can buy something and sell it later at a higher price, it places a bid. I would like to stick to logistic regression and output binary outcomes (win, lose) based on fixed costs and unit price increments (I gain or lose the same amount on every transaction). A false positive hurts me because it means that I buy something and am unable to sell it for a higher price. However, a false negative doesn't hurt me (only in terms of opportunity cost) because it just means if I didn't buy, but if I had, I would have made money. Similarly, a true positive benefits me because I buy and then sell for a higher price, but a true negative doesn't benefit me because I didn't take any action.

I agree that the 0.5 cut-off is completely arbitrary, and when I optimized the model from step 1 on the prediction threshold which yields the highest difference between true/false positives, it turns out to be closer to 0.4. I think this is due to the skewed nature of my data – the ratio between negatives and positives is about 1:3.

Right now, I am following the following steps:

  1. Split data intto training/test
  2. Fit model on training, make predictions in test set and compute difference between true/false positives
  3. Fit model on full, make predictions in test set and compute difference between true/false positives

The difference between true/false positives is smaller in step #3 than in step #2, despite the training set being a subset of the full set. Since I don't care whether the model in #3 has more true negatives and less false negatives, is there anything I can do without altering the likelihood function itself?

Best Answer

You don't seem to want logistic regression at all. What you say is "I would like to maximize the difference between true positives and false positives." That is a fine objective function, but it is not logistic regression. Let's see what it is.

First, some notation. The dependent variable is going to be $Y_i$:
\begin{align} Y_i &= \left\{ \begin{array}{l} 1 \qquad \textrm{Purchase $i$ was profitable}\\ 0 \qquad \textrm{Purchase $i$ was un-profitable} \end{array} \right. \end{align}

The independent variables (the stuff you use to try to predict whether you should buy) are going to be $X_i$ (a vector). The parameter you are trying to estimate is going to be $\beta$ (a vector). You will predict buy when $X_i\beta>0$. For observation $i$, you predict buy when $X_i\beta>0$ or when the indicator function $\mathbf{1}_{X_i\beta>0}=1$.

A true positive happens on observation $i$ when both $Y_i=1$ and $\mathbf{1}_{X_i\beta>0}=1$. A false positive on observation $i$ happens when $Y_i=0$ and $\mathbf{1}_{X_i\beta>0}=1$. You wish to find the $\beta$ which maximizes true positives minus false positives, or: \begin{equation} max_\beta \; \sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} - \sum_{i=1}^N (1-Y_i)\cdot\mathbf{1}_{X_i\beta>0} \end{equation}

This is not an especially familiar objective function for estimating a discrete response model, but bear with me while I do a little algebra on the objective function: \begin{align} &\sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} - \sum_{i=1}^N (1-Y_i)\cdot\mathbf{1}_{X_i\beta>0}\\ = &\sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} - \sum_{i=1}^N \mathbf{1}_{X_i\beta>0} + \sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0}\\ = &\sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} - \sum_{i=1}^N \mathbf{1}_{X_i\beta>0} + \sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} \\ & \qquad + \sum_{i=1}^N 1 - \sum_{i=1}^N 1 + \sum_{i=1}^N Y_i - \sum_{i=1}^N Y_i\\ = &\sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} + \sum_{i=1}^N (1-Y_i)(1-\mathbf{1}_{X_i\beta>0}) - \sum_{i=1}^N 1 + \sum_{i=1}^N Y_i \\ \end{align}

OK, now notice that the last two terms in that sum are not functions of $\beta$, so we can ignore them in the maximization. Finally, we have just shown that the problem you want to solve, "maximize the difference between true positives and false positives" is the same as this problem: \begin{equation} max_\beta \; \sum_{i=1}^N Y_i\cdot\mathbf{1}_{X_i\beta>0} + \sum_{i=1}^N (1-Y_i)(1-\mathbf{1}_{X_i\beta>0}) \end{equation}

Now, that estimator has a name! It is named the maximum score estimator. It is a very intuitive way to estimate the parameter of a discrete response model. The parameter is chosen so as to maximize the number of correct predictions. The first term is the number of true positives, and the second term is the number of true negatives.

This is a pretty good way to estimate a (binary) discrete response model. The estimator is consistent, for example. (Manski, 1985, J of Econometrics) There are some oddities to this estimator, though. First, it is not unique in small samples. Once you have found one $\beta$ which solves the maximization, then any other $\beta$ which makes the exact same predictions in your dataset will solve the maximization---so, infinitely many $\beta$s close to the one you found. Also, the estimator is not asymptotically normal, and it converges slower than typical maximum likelihood estimators---cube root $N$ instead of root $N$ convergence. (Kim and Pollard, 1990, Ann of Stat) Finally, you can't use bootstrapping to do inference on it. (Abrevaya & Huang, 2005, Econometrica) There are some papers using this estimator though---there is a fun one about predicting results in the NCAA basketball tournament by Caudill, International Journal of Forecasting, April 2003, v. 19, iss. 2, pp. 313-17.

An estimator that overcomes most of these problems is Horowitz's smoothed maximum score estimator (Horowitz, 1992, Econometrica and Horowitz, 2002, J of Econometrics). It gives a root-$N$ consistent, asymptotically normal, unique estimator which is amenable to bootstrapping. Horowitz provides example code to implement his estimator on his webpage.

Related Question