Solved – Evaluation metric for rare event probability regression

logisticmachine learningmodel-evaluationrare-events

I'm working on a model to predict the probability of (very) rare events.
Out of 900.000 rows, only around 1000 are "positive" events, and the correlation between variables and outcome is not strong (i.e. the probability of a positive outcome is always < 0.5%)

I'm using logistic regression to predict the probability of such events, but I am not sure on which would be the best metric to evaluate its performance, as I am afraid that some metrics (e.g. RMSE) would not work well on very rare event, where the probability to be calculated can be in the order of 1/10000th (so even always predicting 0 would give a good performance).

Best Answer

  • You could define a custom loss function $L(y,\hat y)$ that quantifies the trade-off between false positives and false negatives. Then you could use the expected loss on held-out data for evaluation. If we denote by $y_i\in\{0,1\}$ the correct label for row $i$ of test data and $\hat p_i$ the corresponding probability predicted by your classifier, the expected loss would be \begin{equation} \frac 1 n \sum_{i=1}^n \hat p_i L(y_i,1) + (1-\hat p_i)L(y_i,0) , \tag{1}\label{eq:exp_loss} \end{equation} where $n$ is the size of the test set. In the case of binary classification, the loss boils down to four numbers $L(0,0), L(0,1), L(1,0),$ and $L(1,1)$. Typically, one would set $L(0,0)=L(1,1)=0$. The important part is setting $L(0,1)$, the penalty for false positives, and $L(1,0)$, the penalty for false negatives. These are highly problem-specific and you will have to judge for yourself how to set these (only their ratio matters). For example, if you are detecting cancer, then maybe a false negative is 100 times as bad as a false positive, and so $L(0,1)=1, L(1,0)=100$.

  • Alternatively, the F1 score might be suitable if you have class imbalance. Since you have probabilistic predictions, you could use $$ \mathrm{precision} = \frac{\sum_{i=1}^n y_i\hat p_i}{\sum_{i=1}^n \hat p_i}, $$ $$ \mathrm{recall} = \frac{\sum_{i=1}^n y_i\hat p_i}{\sum_{i=1}^n y_i}, $$ $$ F_1 = 2\cdot \frac{ \mathrm{recall} \cdot \mathrm{precision}}{\mathrm{recall} + \mathrm{precision}}. $$ The above gives equal importance to precision and recall, so if that is not what you want, consider the $F_\beta$ instead.