Machine Learning – Combining Classifiers by Flipping a Coin: A Fun and Practical Guide

classificationdata visualizationmachine learningprobabilityroc

I am studying a machine learning course and the lecture slides contain information what I find contradicting with the recommended book.

The problem is the following: there are three classifiers:

classifier A providing better performance in the lower range of the thresholds,
classifier B providing better performance in the higher range of the thresholds,
classifier C what we get by flipping a p-coin and selecting from the two classifiers.

What will be the performance of classifier C, as viewed on a ROC curve?

The lecture slides state that just by flipping this coin, we are going to get the magical "convex hull" of classifier A's and B's ROC curve.

I don't understand this point. Just by simply flipping a coin, how can we gain information?

The lecture slide

lecture slides

What the book says

The recommended book (Data Mining… by Ian H. Witten, Eibe Frank and Mark A. Hall) on the other hand states that:

To see this, choose a particular probability cutoff for method A that
gives true and false positive rates of tA and fA, respectively, and
another cutoff for method B that gives tB and fB. If you use these two
schemes at random with probabilities p and q, where p + q = 1, then
you will get true and false positive rates of p . tA + q . tB and p .
fA + q . fB. This represents a point lying on the straight line
joining the points (tA, fA) and (tB, fB), and by varying p and q you
can trace out the whole line between these two points.

In my understanding, what the book says is that to actually gain information and reach the convex hull we need to do something more advanced than simply flipping a p-coin.

AFAIK, the correct way (as suggested by the book) is the following:

we should find an optimal threshold Oa for classifier A
we should find an optimal threshold Ob for classifier B
define C as following:
- If t < Oa, use classifier A with t
- If t > Ob, use classifier B with t
- If Oa < t < Ob, pick between classifier A with Oa and B with Ob by the probability as a linear combination of where we are between Oa and Ob.

Is this correct? If yes, there are a few key differences compared to what the slides suggest.

It's not a simple coin flipping, but a more advanced algorithm which needs manually defined points and picks based on what region we fall into.
It never uses classifier A and B with threshold values between Oa and Ob.

Can you explain to me this problem and what is the correct way to understand it, if my understanding was not correct?

What would happen if we would just simply flip a p-coin like the slides would suggest? I would think that we'd get a ROC curve that is between A and B, but never "better" than the better one at a given point.

As far as I can see, I really don't understand how the slides could be correct. The probabilistic calculation on the left hand side doesn't make sense to me.

Update:
Found the article written by the original author who invented the convex hull method:
http://www.bmva.org/bmvc/1998/pdf/p082.pdf

Best Answer

(Edited)

The lecture slides are right.

Method A has an "optimal point" that gives true and false positive rates of (TPA, FPA in the graph) respectively . This point would correspond to a threshold, or more in general[*] a optimal decision boundary for A. All the same goes for B. (But the thresholds and the boundaries are not related).

It's seen that classifier A performs nice under the preference "minimize false positives" (conservative strategy) and classifier B when we want to "maximize true positives" (eager strategy).

~~The answer to your first question, is basically yes, except that the probability of the coin is (in some sense) arbitrary. The final clasiffier would be:~~

If $x$ belongs to the "optimal acceptance region for A" (conservative), use that classifier A (i.e.: accept it) If $x$ belongs to the "optimal rejection region for B" (eager), use that classifier B (i.e., reject it) Elsewhere , flip a coin with probability $p$ and use the classifier A or B.

(Corrected: actually, the lectures are completely right, we can just flip the coin in any case. See diagrams)

You can use any fixed $p$ in the range (0,1), it depends on whether you want to be more or less conservative, i.e., if you want to be more near to one of the points or in the middle.

[*] You should be general here: if you think in terms of a single scalar threshold, all this makes little sense; a one-dimensional feature with a threshold-based classifier does not gives you enough degrees of freedom to have different classifiers as A and B, that performs along different curves when the free paramenters (decision boundary=threshold) varies. In other words: A and B are called "methods" or "systems", not "classifiers"; because A is a whole family of classifiers, parametrized by some parameter (scalar) that determines a decision boundary, not just a scalar]

I added some diagrams to make it more clear:

enter image description here

Suppose a bidimensional feature, the diagram displays some samples, the green points are the "good" ones, the red the "bad" ones. Suppose that the method A has a tunable parameter $t$ (threshold, offset, bias), higher values of $t$ turns the classifier more eager to accept ('Yes'). The orange lines correspond to the boundary decision for this method, for different values of $t$. It's seen that this method (actually a family of classifiers) performs particularly well for the $t_A=2$, in the sense that it has very few false positives for a moderate amount of true positives. By contrast, the method B (blue), which has its own tunable parameter $t$ (unrelated to that of A) performs particularly well ($t_B=4$) in the region of high acceptance: the filled blue line attains high true positive ratio.

In this scenario, then, one can say that the filled orange line is the "optimal A classifier" (inside its family), and the same for B. But one cannot tell whether the orange line is better than the blue line: one performs better when we asssign high cost to false positives, the other when false negatives are much more costly.

enter image description here

Now, it might happen that these two classifiers are too extremes for our needs, we'd like that both types of errors have similar weights. We'd prefer, instead of using classifier A (orange dot) or B (blue dot) to attain a performance that it's in between them. As the course say, one can attain that result by just flipping a coin and choose one of the classifiers at random.

Just by simply flipping a coin, how can we gain information?

We don't gain information. Our new randomized classifier is not simply "better" than A or B, it's performance is sort of an average of A and B, in what respect to the costs assigned to each type of error. That can be or not beneficial to us, depending on what are our costs.

AFAIK, the correct way (as suggested by the book) is the following ... Is this correct?

Not really. The correct way is simply: flip a coin with probability $p$, choose a classifier (the optimal A or the optimal B) and classify using that classifier.

Related Solutions

Solved – ROC curve for discrete classifiers like SVM: Why do we still call it a “curve”?, Isn’t it just a “point”

Yes, there are situations where the usual receiver operating curve cannot be obtained and only one point exists.
SVMs can be set up so that they output class membership probabilities. These would be the usual value for which a threshold would be varied to produce a receiver operating curve.
Is that what you are looking for?
Steps in the ROC usually happen with small numbers of test cases rather than having anything to do with discrete variation in the covariate (particularly, you end up with the same points if you choose your discrete thresholds so that for each new point only one sample changes its assignment).
Continuously varying other (hyper)parameters of the model of course produces sets of specificity/sensitivity pairs that give other curves in the FPR;TPR coordinate system.
The interpretation of a curve of course depends on what variation did generate the curve.

Here's a usual ROC (i.e. requesting probabilities as output) for the "versicolor" class of the iris data set:

FPR;TPR (γ = 1, C = 1, varying probability threshold):

The same type of coordinate system, but TPR and FPR as function of the tuning parameters γ and C:

FPR;TPR (varying γ, C = 1, probability threshold = 0.5):
FPR;TPR (γ = 1, varying C, probability threshold = 0.5):

These plots do have a meaning, but the meaning is decidedly different from that of the usual ROC!

Here's the R code I used:

svmperf <- function (cost = 1, gamma = 1) {
    model <- svm (Species ~ ., data = iris, probability=TRUE, 
                  cost = cost, gamma = gamma)
    pred <- predict (model, iris, probability=TRUE, decision.values=TRUE)
    prob.versicolor <- attr (pred, "probabilities")[, "versicolor"]

    roc.pred <- prediction (prob.versicolor, iris$Species == "versicolor")
    perf <- performance (roc.pred, "tpr", "fpr")

    data.frame (fpr = perf@x.values [[1]], tpr = perf@y.values [[1]], 
                threshold = perf@alpha.values [[1]], 
                cost = cost, gamma = gamma)
}

df <- data.frame ()
for (cost in -10:10)
  df <- rbind (df, svmperf (cost = 2^cost))
head (df)
plot (df$fpr, df$tpr)

cost.df <- split (df, df$cost)

cost.df <- sapply (cost.df, function (x) {
    i <- approx (x$threshold, seq (nrow (x)), 0.5, method="constant")$y 
    x [i,]
})

cost.df <- as.data.frame (t (cost.df))
plot (cost.df$fpr, cost.df$tpr, type = "l", xlim = 0:1, ylim = 0:1)
points (cost.df$fpr, cost.df$tpr, pch = 20, 
        col = rev(rainbow(nrow (cost.df),start=0, end=4/6)))

df <- data.frame ()
for (gamma in -10:10)
  df <- rbind (df, svmperf (gamma = 2^gamma))
head (df)
plot (df$fpr, df$tpr)

gamma.df <- split (df, df$gamma)

gamma.df <- sapply (gamma.df, function (x) {
     i <- approx (x$threshold, seq (nrow (x)), 0.5, method="constant")$y
     x [i,]
})

gamma.df <- as.data.frame (t (gamma.df))
plot (gamma.df$fpr, gamma.df$tpr, type = "l", xlim = 0:1, ylim = 0:1, lty = 2)
points (gamma.df$fpr, gamma.df$tpr, pch = 20, 
        col = rev(rainbow(nrow (gamma.df),start=0, end=4/6)))

roc.df <- subset (df, cost == 1 & gamma == 1)
plot (roc.df$fpr, roc.df$tpr, type = "l", xlim = 0:1, ylim = 0:1)
points (roc.df$fpr, roc.df$tpr, pch = 20, 
        col = rev(rainbow(nrow (roc.df),start=0, end=4/6)))

Solved – Evaluation of classifiers: learning curves vs ROC curves

Learning curve is only a diagnosing tool, telling you how fast your model learns and whether your whole analysis is not stuck in a quirky area of too small sets / too small ensemble (if applies). The only part of this plot that is interesting for model assessment is the end of it, i.e. the final performance -- but this does not need a plot to be reported.
Selecting a model based on a learning curve as you sketched in your question is rather a poor idea, because you are likely to select a model that is best at overfitting on a too small sample set.

About ROCs... ROC curve is a method to assess binary models that produce a confidence score that an object belongs to one class; possibly also to find them best thresholds to convert them into an actual classifiers.
What you describe is rather an idea to plot your classifiers' performance as a scatterplot of TPR/FPR in the ROC space and use closest-to-top-left-corner criterion to select this which is best balanced between generating false alarms and misses -- this particular aim can be more elegantly achieved by simply selecting model with a best F-score (harmonic mean of precision and recall).

Best Answer

Related Solutions

Solved – ROC curve for discrete classifiers like SVM: Why do we still call it a “curve”?, Isn’t it just a “point”

Solved – Evaluation of classifiers: learning curves vs ROC curves

Related Question