Solved – Evaluation metric for rare event probability regression

logisticmachine learningmodel-evaluationrare-events

I'm working on a model to predict the probability of (very) rare events.
Out of 900.000 rows, only around 1000 are "positive" events, and the correlation between variables and outcome is not strong (i.e. the probability of a positive outcome is always < 0.5%)

I'm using logistic regression to predict the probability of such events, but I am not sure on which would be the best metric to evaluate its performance, as I am afraid that some metrics (e.g. RMSE) would not work well on very rare event, where the probability to be calculated can be in the order of 1/10000th (so even always predicting 0 would give a good performance).

Best Answer

You could define a custom loss function $L(y,\hat y)$ that quantifies the trade-off between false positives and false negatives. Then you could use the expected loss on held-out data for evaluation. If we denote by $y_i\in\{0,1\}$ the correct label for row $i$ of test data and $\hat p_i$ the corresponding probability predicted by your classifier, the expected loss would be \begin{equation} \frac 1 n \sum_{i=1}^n \hat p_i L(y_i,1) + (1-\hat p_i)L(y_i,0) , \tag{1}\label{eq:exp_loss} \end{equation} where $n$ is the size of the test set. In the case of binary classification, the loss boils down to four numbers $L(0,0), L(0,1), L(1,0),$ and $L(1,1)$. Typically, one would set $L(0,0)=L(1,1)=0$. The important part is setting $L(0,1)$, the penalty for false positives, and $L(1,0)$, the penalty for false negatives. These are highly problem-specific and you will have to judge for yourself how to set these (only their ratio matters). For example, if you are detecting cancer, then maybe a false negative is 100 times as bad as a false positive, and so $L(0,1)=1, L(1,0)=100$.
Alternatively, the F1 score might be suitable if you have class imbalance. Since you have probabilistic predictions, you could use $$ \mathrm{precision} = \frac{\sum_{i=1}^n y_i\hat p_i}{\sum_{i=1}^n \hat p_i}, $$ $$ \mathrm{recall} = \frac{\sum_{i=1}^n y_i\hat p_i}{\sum_{i=1}^n y_i}, $$ $$ F_1 = 2\cdot \frac{ \mathrm{recall} \cdot \mathrm{precision}}{\mathrm{recall} + \mathrm{precision}}. $$ The above gives equal importance to precision and recall, so if that is not what you want, consider the $F_\beta$ instead.

Related Solutions

Solved – Performance metric for algorithm predicting probability of low probability events

Mean squared error as suggested by Lakret will certainly work, however, I'd like to propose a method which captures the uncertainty of the clickrates of the adds (which are not known, exactly, but only estimated from historic data).

Let's say we have an add in our validation set with 10000 showns and 10 clicks, i.e. the maximum likelihood estimate for the clickrate $p$ is $0.001$. Furthermore we predicted a clickrate of $\hat{p}$ for this add.

Now instead of comparing the predicted $\hat{p}$ with p, we check whether $\hat{p}$ is in the confidence interval of $p$. Using the Beta-Distribution aka the Bayesian approach to calculate the confidence interval (called credible intervals then), we get using R

alpha <- 0.05
qbeta(c(alpha/2,1-alpha/2),10+1,10000-10+1)
# which results in [1] 0.000549185 0.001838080

For other methods to calculate binomial confidence intervals see e.g. the R-package confint.

Now, the error for the prediction of a single add is ...

0, if $\hat{p}$ is in the confidence interval of p
1, else

Starting from here, one can calculate binomial metrics like precision OR just the average error across multiple clickrate predictions. In a more sophisticated approach one could calculate the error as the distance to the nearest confidence interval bound (if outside the confidence interval) to make the error less discrete.

Solved – Evaluation of classifier using ROC curve in the presence of rare events

Let us try it out. Generate positively correlated quantitative classifier variable and binary state variable (0="negative", 1="positive"). And supply 3 weighting variables. Weight1 makes distribution 0/1 = 45/45. Weight2 makes it 15/75 (i.e. positive event is frequent). Weight3 makes it 75/15 (i.e. positive event is rare).

classifier    state  weight1  weight2  weight3
     .801         0        3        1        5
     .270         0        3        1        5
     .253         0        3        1        5
     .220         0        3        1        5
     .142         0        3        1        5
     .229         0        3        1        5
     .352         0        3        1        5
     .341         0        3        1        5
     .198         0        3        1        5
     .169         0        3        1        5
     .525         0        3        1        5
     .533         0        3        1        5
     .395         0        3        1        5
     .586         0        3        1        5
     .072         0        3        1        5
     .776         1        3        5        1
     .772         1        3        5        1
     .813         1        3        5        1
     .507         1        3        5        1
     .112         1        3        5        1
     .664         1        3        5        1
     .979         1        3        5        1
     .877         1        3        5        1
     .414         1        3        5        1
     .887         1        3        5        1
     .675         1        3        5        1
     .514         1        3        5        1
     .793         1        3        5        1
     .622         1        3        5        1
     .468         1        3        5        1

Weight the data with the weight variables one by one and perform ROC (I did it in SPSS). Below are statistics for Area under the curve.

Area    Std. Error(a)   Asymptotic Sig.(b)  Asymptotic 95% Confidence Interval  
                                              Lower Bound   Upper Bound
Weighted by weight1:
.840        .045            2.76045E-008             .753          .927
Weighted by weight2:
.840        .056            3.45509E-005             .731          .949
Weighted by weight3:
.840        .064            3.45509E-005             .715          .965

(a) Under the nonparametric assumption              
(b) Null hypothesis: true area = 0.5

You may notice that Area is the same, be the positive event rare, frequent or in-between. However, Error of the Area and other statistics around it are affected by whether the positive event is rare, frequent or in-between. The shape of curve itself (shown below) is not affected. So, background "rareness" of positive event has no impact on the choice of optimal classification cut-point in the classifier variable.

enter image description here

Best Answer

Related Solutions

Solved – Performance metric for algorithm predicting probability of low probability events

Solved – Evaluation of classifier using ROC curve in the presence of rare events

Related Question