Solved – Is dumthe classifier precision always 0.5, even on unbalanced datasets

classificationprecision-recallroc

I am trying to make a classifier that would predict whether an item is a "buy" (positive) or "not a buy" (negative). My dataset is ~65% positive examples, and ~35% negatives. It is more important for my model to keep false positives down (at the cost of reducing true positives). Let's say that the acceptable level of false positives is 5%. I am OK with my model missing a lot of "buy" opportunities, but when it recommends me to buy, I want to have a great degree of confidence (>95%) it actually is a buy.

I train my classifier, draw a ROC curve, and select the threshold such that false positives are less than 5% (which gives me true positive rate of 35%).

Now, am I correct in saying that a dummy classifier (i.e. one that ignores the features when making its prediction, but only looks at the distribution of positive and negative examples) will never be able to get true positive rate better than 5% if false positive rate is to be kept less than 5%? That is, its precision can never be better than 0.5, regardless of whether the dataset is balanced or unbalanced?

Best Answer

A naive classifier will classify each case as "positive" at random, with a constant probability $q$. For instance, it could classify everything as "positive" ($q=1$), or classify everything as "negative" ($q=0$), or anything in between.

Let's call your true prevalence of positives $p$. In your example, $p=0.65$. If we now classify using the naive classifier as above, we will get:

  • $TP=qp$ true positives
  • $TN=(1-q)(1-p)$ true negatives
  • $FP=q(1-p)$ false positives
  • $FN=(1-q)p$ false negatives

For instance, the $TP$ comes from the fact that the proportion of actual positives is $p$, and we classify a proportion of $q$ of these as "positive". The others are calculated similarly.

The precision then is

$$\text{Precision}=\frac{TP}{TP+FP}=\frac{qp}{qp+q(1-p)}=\frac{qp}{q}=p,$$

i.e., the true positive prevalence. In your case, any naive classifier will have a precision of $p=0.65$, more than $0.5$.

This assumes that we can cancel $q$, that is, that $q>0$. Happily enough, the false positive rate under this naive classifier is

$$ FPR = \frac{FP}{N}=\frac{q(1-p)}{1-p}=q, $$

so choosing any $q\leq 0.05$ will satisfy your criterion of no more than 5% false positive rate. Incidentally, the true positive rate is also

$$ TPR = \frac{TP}{P}=\frac{qp}{p}=q. $$

You may be interested in this: Why is accuracy not the best measure for assessing classification models?

Related Question