In order to get the accuracy of the predication you can do:
print accuracy_score(expected, y_1)
If you want a few metrics, such as, precision, recall, f1-score you can get a classification report:
print classification_report(expected, y_1)
A confusion matrix will tell how many of the samples that were classified are classified according to which label. This will tell you if your classifier confuses some categories.
The functions to get these metrics are independent of the classification model you are using. (So you can easily test an SVM for example)
You should use predict()
since this will give the labels of the classified samples. predict_proba
will give the propability of a sample belonging to a category
I recommend reading a few of the documentation pages:
The confusion matrix is a way of tabulating the number of misclassifications, i.e., the number of predicted classes which ended up in a wrong classification bin based on the true classes.
While sklearn.metrics.confusion_matrix provides a numeric matrix, I find it more useful to generate a 'report' using the following:
import pandas as pd
y_true = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2])
y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2])
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)
which results in:
Predicted 0 1 2 All
True
0 3 0 0 3
1 0 1 2 3
2 2 1 3 6
All 5 2 5 12
This allows us to see that:
- The diagonal elements show the number of correct classifications for each class: 3, 1 and 3 for the classes 0, 1 and 2.
- The off-diagonal elements provides the misclassifications: for example, 2 of the class 2 were misclassified as 0, none of the class 0 were misclassified as 2, etc.
- The total number of classifications for each class in both
y_true
and y_pred
, from the "All" subtotals
This method also works for text labels, and for a large number of samples in the dataset can be extended to provide percentage reports.
import numpy as np
import pandas as pd
# create some data
lookup = {0: 'biscuit', 1:'candy', 2:'chocolate', 3:'praline', 4:'cake', 5:'shortbread'}
y_true = pd.Series([lookup[_] for _ in np.random.random_integers(0, 5, size=100)])
y_pred = pd.Series([lookup[_] for _ in np.random.random_integers(0, 5, size=100)])
pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted']).apply(lambda r: 100.0 * r/r.sum())
The output then is:
Predicted biscuit cake candy chocolate praline shortbread
True
biscuit 23.529412 10 23.076923 13.333333 15.384615 9.090909
cake 17.647059 20 0.000000 26.666667 15.384615 18.181818
candy 11.764706 20 23.076923 13.333333 23.076923 31.818182
chocolate 11.764706 5 15.384615 6.666667 15.384615 13.636364
praline 17.647059 10 30.769231 20.000000 0.000000 13.636364
shortbread 17.647059 35 7.692308 20.000000 30.769231 13.636364
where the numbers now represent the percentage (rather than number of cases) of the outcomes that were classified.
Although note, that the sklearn.metrics.confusion_matrix
output can be directly visualized using:
import matplotlib.pyplot as plt
conf = sklearn.metrics.confusion_matrix(y_true, y_pred)
plt.imshow(conf, cmap='binary', interpolation='None')
plt.show()
Best Answer
You could use pseudo R-squared measures, such as Nagelkerke, for an overview see:
http://www.ats.ucla.edu/stat/mult_pkg/faq/general/Psuedo_RSquareds.htm