Let's suppose I have two models that both indicate the presence of some phenomenon:

- Model A: Only binary results, i.e., the phenomenon is present or not,
- Model B: Outputs class probabilities.

Of course, I could impose a decision rule for model B to derive binary results, too, but I am looking for methods to compare the *performance* (whatever that means, precision perhaps?) of both models, i.e., which model's decision incur a lower cost.

There are approx. 20 data points, and expert-crafted ground truth for each. The experts indicated the presence of a phenomenon on a scale of 0-10 (where 0 indicates complete absence and 10 a strong manifestation of the phenomenon).

As for model A, It was agreed beforehand that a ground truth <= 5 would mean the phenomenon is **not** present.

As for model B, the probability for the most-likely class would be scaled by a factor of 10, so that deviation against the ground truth could be measured.

The features used for crafting the ground truth are distinct from the features used in the models. Model A uses thresholds and indicator functions to indicate the phenomenon, model B is a regression model that outputs class probabilities. Would it be fair to apply the Brier score to compare both models, by pretending model A outputs the probabilities 0/1? Looking at the definition of the Brier score, it seems appropriate to me.

**Question 1**: What can I actually compare between both models? Precision? Accuracy? etc.?

**Question 2**: How to do these comparisons between both model types? Ideally, I could derive statements such as one model *outperforms* (whatever that could mean) the other.

## Best Answer

The ground-truth data is labeled on a scale from 0 to 10. For convenience and without loss of information, you can rescale the experts' labels to the [0,1] range (divide them by 10). From now on let's assume both the predictions and the labels are on the same scale.

Since the labels themselves are proportions, not yes/no labels, the measure $$ \frac{1}{n} \sum_{i=1}^n (prediction_i - truth_i)^2 $$ is more correctly called

mean squared error, notBrier score. You can use it to compare the performance of the two models and choose the one with the smaller MSE.Since the experts themselves didn't make a hard decision about the presence or not of the phenomenon, just about the strength of its presence, binary classification metrics such as accuracy don't seem appropriate. And you can't compute them unless you go back to the experts and ask them to binarize their labels.

Update: Model A can have smaller MSE by outputting fractional labels instead of 0s and 1s. This is a bit contrived. (But so is the use of a binary classifier to describe a complex phenomenon that even experts rate on a scale.)Suppose there are expert-assigned labels for some training data. (Don't use the 20 instances set aside for comparing model A and B or you will give A unrealistic advantage.) You apply model A to these training examples and you get the TPs, FNs, FPs, TNs (true positives, false negatives, etc.) For the comparison with model B, you can adjust model A to output probability p0 = {average label of TNs & FNs} instead of 0 and probability p1 = {average label of TPs & FPs} instead of 1; these values minimize its MSE on the training examples.