Solved – hack weighted loss function by creating multiple copies of data

classificationloss-functionsmachine learningoptimization

Suppose we want to build a binary classifier with weighted loss, i.e., it penalize different types of errors (false positive and false negative) differently. At the same time, the software we are using does not support a weighted loss.

Can I hack it by manipulating my data?

For example, suppose we are doing some fraud detection problem (let's assume the prior is 50% to 50% fraud vs. normal here, although most fraud detection are extremely imbalanced), where we can afford some false positives (false alerts on normal transactions), but really want to avoid false negatives (missed detection on fraud transactions).

Let's say we want the loss ratio to be 1:5 (false positive : false negative), can we make 5 copies of my fraud transactions?

Intuitively, by doing such copy we changed the prior distribution, and the model would more likely to say a transaction is a fraud one. So the false negative will be reduced.

My guess is if we are truly minimize 0-1 loss, this can do the trick, but if we are minimizing a proxy/logistic/hinge loss (see this post), then this hack will not work well.

Any formal/mathematical explanations?

Best Answer

Yes you can (as long as your weights are integers (fractional to be pedantic)), though it's obviously not very efficient.

To see this, note that most loss functions can be written as $$\text{loss}(y, p) = \sum_{i=1}^n l(y_i, p_i)$$ where $p_i$ is the predicted value of $y_i$ for a suitable function $l$.

We can easily transform this to a weighted loss function by introducing weights: $$\text{weighted loss}(y, p) = \sum_{i=1}^n w_i l(y_i, p_i)$$

Now we see that if we duplicate each observation $i$ $w_i$ times, and minimize the (unweighted) loss, that this is equivalent to minimizing the weighted loss with weights $w_i$. Of course duplicating something $\pi$ times is difficult so make sure your weights are integers.

Note that adding a regularization penalty to the loss function does not have any effects on this reasoning.