Solved – Summing predicted probabilities from logistic regression using ‘one vs. rest’

logisticmulti-classprobabilitypythonscikit learn

I have a multiclass classification problem that I have solved using a 'one vs. rest' approach via binary logistic regression classifiers from Python's scikit-learn package. In my problem, there are 3 classes for which I am trying to predict probabilities. I have trained 3 logistic regression classifiers (i.e. 1 for each of the classes), but how should one combine the probabilities from the independent classifiers?

Based on reading through the literature, there is no reason for the probabilities to add up to 1 when considering binary logistic regression for a multiclass problem. Yet when I use my trained logistic regression classifiers, the summed probabilities from all 3 of them always add up to exactly 1.

After further investigation, I have discovered the following: I can extract the coefficients and obtain the following parameters (i.e. 4 weights for the 4 input features and bias terms) for my 3 logistic regression classifiers:

weights = array([[ 0.11948853, -3.2523997 ,  1.81306023, -0.29211884],
       [ 1.16800984,  2.32887278, -1.72453382, -1.17726167],
       [-2.39495012,  1.6836802 , -0.83319772,  1.86957419]])

intercepts = array([ 2.79030646, -3.89132394, -4.53314668])

I computed the probabilities myself by plugging an example set of input features (which correspond to different inputs at different times) into a sigmoid, and arrived at the following predicted probabilities from my classifiers:

enter image description here

Clearly, the probabilities do not add up to 1. Yet, when I instead use the predict_proba routine from scikit-learn, I arrive at the following input:

enter image description here

The probabilities now do add up to 1 as I noted above. It appears some normalization has been applied, but does anyone know what the normalization procedure applied here is by scikit-learn? I've tried attempting multiple approaches to normalize the probabilities (e.g. normalizing the individual probabilities by the squared sum, applying softmax to either the probability or linear predictor), and although they're close, they do not quite exactly match the above output from predict_proba. Any ideas on what exactly scikit-learn is doing here to combine the probabilities?

Best Answer

The output probabilities are normalized by Scikit-Learn. See source code at line 317. Note it's just a linear normalization instead of softmax etc.

def predict_proba(self, X):
        """Probability estimates.
        The returned estimates for all classes are ordered by label of classes.
        Note that in the multilabel case, each sample can have any number of
        labels. This returns the marginal probability that the given sample has
        the label in question. For example, it is entirely consistent that two
        labels both have a 90% probability of applying to a given sample.
        In the single label multiclass case, the rows of the returned matrix
        sum to 1.
        Parameters
        ----------
        X : array-like, shape = [n_samples, n_features]
        Returns
        -------
        T : (sparse) array-like, shape = [n_samples, n_classes]
            Returns the probability of the sample for each class in the model,
            where classes are ordered as they are in `self.classes_`.
        """
        check_is_fitted(self, 'estimators_')
        # Y[i, j] gives the probability that sample i has the label j.
        # In the multi-label case, these are not disjoint.
        Y = np.array([e.predict_proba(X)[:, 1] for e in self.estimators_]).T

        if len(self.estimators_) == 1:
            # Only one estimator, but we still want to return probabilities
            # for two classes.
            Y = np.concatenate(((1 - Y), Y), axis=1)

        if not self.multilabel_:
            # Then, probabilities should be normalized to 1.
            Y /= np.sum(Y, axis=1)[:, np.newaxis]
        return Y