Solved – Why are the ROC curves not smooth

classificationrocscikit learn

The following are some performance results that I got from the currently trained model on both the training and validation data sets. There are 3 classes with imbalanced training samples. I use the sklearn.metrics to compute the metrics with average='weighted'.

results on training and validation datasets

And the following are the ROC curves (the first is from the training data set and the second is from the validation data set).

ROC curve on training data set
ROC curve on validation data set

Class 0 (denoted as C0) is the background class, Class 1 (denoted as C1) and Class 2 (denoted as C2) are the positive classes. I want to increase the accuracy on both C1 and C2. The ROC curves seem to be not smooth. Is this a valid model? What can I get from these results? How to improve them, especially to tackle the class imbalance problem? Any comments are appreciated. Thanks!

UPDATED:
The source code is as follows:

code

Best Answer

I know the question is two years old and the technical answer was given in the comments, but a more elaborate answer might help others still struggling with the concepts.

OP's ROC curve wrong because he used the predicted values of his models instead of the probabilities.

What does this mean?

When a model is trained it learns the relationships between the input variables and the output variable. For each observation the model is shown, the model learns how probable it is that a given observation belongs to a certain class. When the model is presented with the test data it will guess for each unseen observation how probable it is to belong to a given class.

How does the model know if an observation belongs to a class? During testing the model receives an observation for which it estimates a probability of 51% of belonging to Class X. How does take the decision to label as belonging to Class X or not? The researcher will set a threshold telling the model that all observations with a probability under 50% must be classified as Y and all those above must be classified as X. Sometimes the researcher wants to set a stricter rule because they're more interested in correctly predicting a given class like X rather than trying to predict all of them as well.

So you trained model has estimated a probability for each of your observations, but the threshold will ultimately decide to in which class your observation will be categorized.

Why does this matter?

The curve created by the ROC plots a point for each of the True positive rate and false positive rate of your model at different threshold levels. This helps the researcher to see the trade-off between the FPR and TPR for all threshold levels.

So when you pass the predicted values instead of the predicted probabilities to your ROC you will only have one point because these values were calculated using one specific threshold. Because that point is the TPR and FPR of your model for one specific threshold level.

What you need to do is use the probabilities instead and let the threshold vary.

Run your model as such:

from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn_model = knn.fit(X_train,y_train)
#Use the values for your confusion matrix
knn_y_model = knn_model.predict(X=X_test)
# Use the probabilities for your ROC and Precision-recall curves
knn_y_proba = knn_model.predict_proba(X=X_test)

When creating your confusion matrix you will use the values of your model

from mlxtend.plotting import plot_confusion_matrix
fig, ax = plot_confusion_matrix(conf_mat=confusion_matrix(y_test,knn_y_model),
                                show_absolute=True,show_normed=True,colorbar=True)
plt.title("Confusion matrix - KNN")
plt.ylabel('True label')
plt.xlabel('Predicted label'

When creating your ROC curve you will use the probabilities

import scikitplot as skplt
plot = skplt.metrics.plot_roc(y_test, knn_y_proba)
plt.title("ROC Curves - K-Nearest Neighbors")