Solved – R – Plotting a ROC curve for a Naive Bayes classifier using ROCR. Not sure if I’m plotting it correctly

classificationmachine learningnaive bayesrroc

I have a Naive Bayes classifiers that I'm using to try to predict whether a game is going to win or lose based on historical data. The model has 25 variables in total, all of which are categorical factors. The class node is the games "Status" which is binary with outcomes: won and lost.

I'm using the bnlearn package to build the classifiers, and plotting the ROC curves with the ROCR package. My question problem essential comes from not understanding what I should be plotting. I know that ROC curves plot the TPR against the FPR. But should I be plotting the predicted probabilities vs the actual games outcomes? Or should I be plotting the predicted labels vs the actual games outcomes?

Here is a sample of my code, to hopefully help you understand what I mean.

#BUILD THE NAIVE MODEL
naive = naive.bayes(trainingdata, "Status")
fitted = bn.fit(naive, trainingdata, method="bayes", iss=1)

##RUN PREDICTION AND GET PROBABILITIES
pred = predict(fitted, testdata, prob=TRUE)
results_prob = data.frame(t(attributes(pred)$prob))

Now, after doing this I have the predictions for each row in the testdata, as well as the associated probabilities with each prediction. It looks something like this:

x <- cbind(pred, results_prob)
head(x)

  pred         Lost          Won
1 Lost 0.9926455284 7.354472e-03
2  Won 0.3744111013 6.255889e-01
3 Lost 0.9978362577 2.163742e-03
4  Won 0.0001894814 9.998105e-01
5 Lost 0.9999974266 2.573381e-06
6 Lost 0.6750732745 3.249267e-01

Right, so now I'm going to try and plot a ROC curve. I already have the actual outcomes of the sales saved in realResults, which looks like this

head(data.frame(realResults))

     realResults
1        Lost
2        Lost
3        Lost
4         Won
5        Lost
6        Lost

So here's what happens when I try to plot a ROC curve. I can try it 3 ways, one of which works, but I'm not sure if it's the correct way, which is why I've come here in the hopes that someone can explain if it is, or is not correct.

If I try to plot it with the prediction labels against the actual outcomes, like

library(ROCR)
pr <- prediction(pred, realResults)
prf <- performance(pr, "tpr", "fpr")
plot(prf)

I will get an error saying Error in prediction(pred, realResults) :
Format of predictions is invalid.

If I try it by plotting the probabilities against the actual outcomes, like

pr <- prediction(results_prob, realResults)

I will get an error saying Error in prediction(results_prob, realResults) :
Number of cross-validation runs must be equal for predictions and labels.

Finally, it will plot a ROC curve if I only do it for the probabilities of a win

pr <- prediction(results_prob$Won, realResults)
prf <- performance(pr, "tpr", "fpr")
plot(prf)

This gets an AUC of 0.94.

So my question is, have I done this correctly? I can't understand why the first method of plotting a ROC doesn't work, as the predictions and the results have the same format. They both have 2 labels: won and lost. I don't understand the error I'm getting for that one.

Can anyone offer some insight into this? The model's accuracy is about 88%, so there's no real problem there. It's just with plotting the ROCs. I want to ensure I'm going about it correctly. Apologies for the long post! Thanks

EDIT: I've been reading up on ROC curves, but still haven't be able to clarify if I am going about this approach correctly?

Best Answer

The problem may lies here

pr <- prediction(pred, realResults)

You may transform "pred" and "realResults" to 0-1 vector by :

predvec <- ifelse(pred=="Lost", 1, 0)
realvec <- ifelse(realResults=="Lost", 1, 0)

and using:

pr <- prediction(predvec, realvec)

the problem may be solved.

Bonus part:

you can plot roc curve with more information by:

plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

And a simple way to get auc:

as.numeric(performance(ROCRpred, "auc")@y.values)