Solved – My metric is 0.65*accuracy + 0.35*recall. How to convert that to continuous loss function

information theorylogisticloss-functionsprecision-recallregression

I have some data (14000 examples, with 1000 features, very sparse). My goal is to predict 50 binary values for each example, and there usually are multiple positive values. I want to train a logistic regression and a feedforward neural network, but to do so I need a smooth loss function – meaning a function that has a continuous derivative. What function's optimum values are the optimum values of this metric? What if the metric was e.g. 0.3*precision + 0.5*recall + 0.2*accuracy? Or F1 score?

My current solution is to use binary crossentropy, and then select such a threshold for the probability that maximizes the given metric. That feels very "hacky" though.

Best Answer

Your procedure isn't hacky, its the correct thing to do.

As a thought experiment, imagine a situation where an oracle told you the true class probabilities

$$P(y \mid X)$$

If you actually knew this function, you would have complete knowledge about the situation.

Now suppose you need to make a hard classification, i.e., you actually need to take each possible value of $y$ and assign it to either the $1$ class or the $0$ class. Further, you would like to do so in order to minimize some cost function which depends on your choices.

Suppose you use a rule $R$ that does not arise from setting some threshold on the conditional probabilities, i.e. the direction of class assignment somewhere disagrees with the direction of the class probabilities. Then there would be two data points $(x_1, y_1)$ and $(x_2, y_2)$ with

$$ P(y_1 \mid x_1) < P(y_2 \mid x_2)$$

yet

$$ R(x_1) = 1, R(x_2) = 0 $$

Then, interpreting "average" as the behaviour I would observe across repeatedly sampled datasets including $x_1$ and $x_2$, the average accuracy of this rule is:

$$\frac{P(y_1 \mid x_1) + (1 - P(y_2 \mid x_2))}{2} = \frac{P(y_1 \mid x_1) - P(y_2 \mid x_2) + 1}{2}$$

While the average accuracy of the rule that makes the opposite assignment is:

$$ \frac{P(y_2 \mid x_2) - P(y_1 \mid x_1) + 1}{2} $$

Because $ P(y_1 \mid x_1) < P(y_2 \mid x_2) $, the average accuracy of the second rule is always larger. You should be able to make similar arguments with other metrics for hard classification, precision and recall for example.

So the direction of class assignment must agree with the directionality of the conditional class probabilities, otherwise a better rule is available (at least on average).

Note: It is important in the above argument that the evaluation metric be a function of only the true $y$ and classified $y$ alone. If, instead, it is a function of $x$ as well (say customers who own a home are more valuable to us than those without a home, where home ownership is a feature in our prediction scheme) then it may pay to set a different threshold for different subsets of our population. The argument above still tells us that our classification rule should be a monotonic function of the class probabilities within each subpopulation.

Related Question