Solved – How to evaluate the predicted values in regression model

predictionpredictive-modelspython

I am using regression model to predict values I have. How can evaluate the accuracy of prediction model (I'd like to see how the accuracy of predicted values).

I found different metrics, but it's only used with discrete data.

Classification report
Accuracy score
Confusion matrix

Now, I can generate the summary using .summary() function. As an example here , it's nearly the same in R/Python;

I'd like to evaluate the performance, how good and accurate the predicted values that I got using GLM (with Binomial families ).

Best Answer

You could use pseudo R-squared measures, such as Nagelkerke, for an overview see:

http://www.ats.ucla.edu/stat/mult_pkg/faq/general/Psuedo_RSquareds.htm

Related Solutions

Solved – How to evaluate the predicted values using Scikit-Learn

In order to get the accuracy of the predication you can do:

print accuracy_score(expected, y_1)

If you want a few metrics, such as, precision, recall, f1-score you can get a classification report:

print classification_report(expected, y_1)

A confusion matrix will tell how many of the samples that were classified are classified according to which label. This will tell you if your classifier confuses some categories.

The functions to get these metrics are independent of the classification model you are using. (So you can easily test an SVM for example)

You should use predict() since this will give the labels of the classified samples. predict_proba will give the propability of a sample belonging to a category

I recommend reading a few of the documentation pages:

Solved – How to interpret a confusion matrix

The confusion matrix is a way of tabulating the number of misclassifications, i.e., the number of predicted classes which ended up in a wrong classification bin based on the true classes.

While sklearn.metrics.confusion_matrix provides a numeric matrix, I find it more useful to generate a 'report' using the following:

import pandas as pd
y_true = pd.Series([2, 0, 2, 2, 0, 1, 1, 2, 2, 0, 1, 2])
y_pred = pd.Series([0, 0, 2, 1, 0, 2, 1, 0, 2, 0, 2, 2])

pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted'], margins=True)

which results in:

Predicted  0  1  2  All
True                   
0          3  0  0    3
1          0  1  2    3
2          2  1  3    6
All        5  2  5   12

This allows us to see that:

The diagonal elements show the number of correct classifications for each class: 3, 1 and 3 for the classes 0, 1 and 2.
The off-diagonal elements provides the misclassifications: for example, 2 of the class 2 were misclassified as 0, none of the class 0 were misclassified as 2, etc.
The total number of classifications for each class in both y_true and y_pred, from the "All" subtotals

This method also works for text labels, and for a large number of samples in the dataset can be extended to provide percentage reports.

import numpy as np
import pandas as pd

# create some data
lookup = {0: 'biscuit', 1:'candy', 2:'chocolate', 3:'praline', 4:'cake', 5:'shortbread'}
y_true = pd.Series([lookup[_] for _ in np.random.random_integers(0, 5, size=100)])
y_pred = pd.Series([lookup[_] for _ in np.random.random_integers(0, 5, size=100)])

pd.crosstab(y_true, y_pred, rownames=['True'], colnames=['Predicted']).apply(lambda r: 100.0 * r/r.sum())

The output then is:

Predicted     biscuit  cake      candy  chocolate    praline  shortbread
True                                                                    
biscuit     23.529412    10  23.076923  13.333333  15.384615    9.090909
cake        17.647059    20   0.000000  26.666667  15.384615   18.181818
candy       11.764706    20  23.076923  13.333333  23.076923   31.818182
chocolate   11.764706     5  15.384615   6.666667  15.384615   13.636364
praline     17.647059    10  30.769231  20.000000   0.000000   13.636364
shortbread  17.647059    35   7.692308  20.000000  30.769231   13.636364

where the numbers now represent the percentage (rather than number of cases) of the outcomes that were classified.

Although note, that the sklearn.metrics.confusion_matrix output can be directly visualized using:

import matplotlib.pyplot as plt
conf = sklearn.metrics.confusion_matrix(y_true, y_pred)
plt.imshow(conf, cmap='binary', interpolation='None')
plt.show()

Best Answer

Related Solutions

Solved – How to evaluate the predicted values using Scikit-Learn

Solved – How to interpret a confusion matrix

Related Question