Solved – classification problem in sklearn

classificationscikit learn

I have some experience with building classification models in R but it's my first time with Python's sklearn.

So the the problem is: when fitting logistic regression model in R, I have immediate access to predicted class probabilities (model$fitted.values), so I can set my threshold (different than 0.5) in order to maximize some measure.

But in sklearn after fitting I can't find a way to access probabilities. Is it possible? There is a method predict_proba(), but…as the name suggests, it is prediction. So in order to get probabilities, should I 'artificially' do the following procedure?

model = sklearn.linear_model.LogisticRegression()
model.fit(train_X, train_y)
probs = model.predict_proba(train_X)

Does it make any sense? Or is there some different method to obtain it?

Best Answer

It is correct to use the method predict_proba to get the estimated probabilities. Just to make it a bit clearer, consider the example given here (http://scikit-learn.org/stable/auto_examples/linear_model/plot_iris_logistic.html).

Below I copy just the relevant lines, so that anyone can copy & paste it, and reproduce what I say next,

import numpy as np
from sklearn import linear_model, datasets
iris = datasets.load_iris()
X = iris.data[:, :2]  # we only take the first two features.
Y = iris.target
h = .02  # step size in the mesh
logreg = linear_model.LogisticRegression(C=1e5)
logreg.fit(X, Y)

In the iris dataset you have three classes (0, 1, 2 = ['setosa', 'versicolor', 'virginica']), contained in Y. When you call,

logreg.predict_proba(X)

you get an array of arrays of probabilities, each element being p(class|x),

array([[  9.05823905e-01,   6.81672013e-02,   2.60088939e-02],
       [  7.64631786e-01,   2.16376590e-01,   1.89916235e-02],
       [  8.46908157e-01,   1.42190177e-01,   1.09016662e-02],
       [  8.15654921e-01,   1.75608861e-01,   8.73621791e-03],
       ...

The other alternatives are predict_log_proba() to get log P(class|x) and predict() to get the class with highest probability among the three for a given sample.