Solved – Confidence intervals for the Log Loss metric for model comparison

classificationconfidence intervalcross-validationmodel selectionpredictive-models

Quite a few Kaggle competitions have used or are using the Logarithmic Loss metric as the quality measure of a submission.

I'm wondering if there are other ways besides N-fold cross-validation to calculate confidence intervals for this metric. If model X has a log loss of 0.123456 on the test set and model Y has a log loss of 0.123457, I'm sure you'll agree that model X is not significantly better than model Y, unless we're talking about a gazillion data points.

Why something else than N-fold cross-validation? Simple answer: performance. For a certain application I need to know whether model X is significantly better than model Y (when looking at Log Loss). In other words, I need to know whether the Log Loss for model X falls outside the 95% confidence interval for the Log Loss of model Y.

I need to do this comparison many, many times with different models and datasets that are coming in every day. Performance is crucial, so doing 10-fold cross-validation a 1,000 times to get a rough estimate of the confidence intervals is not going to cut it, I'm afraid. The datasets for which I have to calculate the log loss are usually in the range of say 50 positives and 10,000 negatives to 20,000 positives and 1 million negatives.

What would you advise?

Best Answer

A reasonable way to estimate the confidence interval for log_loss metric is to assume that the model X is a perfect model, meaning that X gives true probabilities of output of classes for every sample in test set.

Consider two-class case. Then log_loss metric is an average of independent $N$ random variables, each one of which takes value $\log(p_j)$ with probability $p_j$ and value $\log(1-p_j)$ with probability $(1-p_j)$. Here $N$ is the size of test set.

It is theoretically possible to compute the resulting distribution of log_loss as a sum of random variables with known distributions analytically, but is quite challenging. It is easier to get a numerical approximation by Monte-Carlo procedure.

Here is a sketch of the algorithm:

  1. Compute predictions $p_j$ of model X for each sample from test set.

  2. Generate fake class labels for each sample $p_j<u$, where $u$ is uniform on $(0..1)$.

  3. Compute log_loss using fake class labels

  4. Repeat steps 2 and 3 many times.

  5. Compute standard deviation of log_loss estimates over the repetitions.

  6. Compute confidence interval estimation by multiplying standard deviation by a constant corresponding to desired confidence level.

I have evaluated this approach on a proprietary click through dataset by comparing estimated standard deviation to the standard deviation obtained by splitting test set into multiple independent subsets of equal size and computing log_loss standard deviation from them.

I have found that standard deviation of the log_loss depends on the number of samples in test set $N$ in an expected way: $$ std = \frac{a}{\sqrt N}, $$ where $a\approx0.5$ and depends mildly on the average click through rate.

From this a very rough estimate (maybe up to a factor of 2) for 95% confidence interval of log_loss evaluated on 10 million samples is $\pm 0.0003$

For the recent Kaggle CTR contest the final standings looks like this: 1 0.3791384 2 0.3803652 3 0.3806351 4 0.3810307

Assuming 10 million records in the test set, I believe that first place score is significantly better than the second, but difference between the second and the third place may not be significant.