Solved – Logistic-Regression: Prior correction at test time

logisticmachine learningpythonscikit learn

Using sklean.linear_model.LogisticRegression for a binary classification problem. My classes are unbalanced. The positive class comprises about 20% of the training set. When fitting the model I use:

logreg = LogisticRegression(class_weight="auto")
logreg.fit(X_trn,y_trn)

which lets sklearn give greater weight to the infrequent positive class during training. But now I want to re-balance the class_weights for test time. My first intuition is to adjust the logreg.intercept_ member of fitted model. Would this be the correct approach?

Best Answer

For any distribution with over binary variable $C$ and continuous variable $x$: \begin{align} p(C_1|x) &= \frac{p(x|C_1)p(C_1)}{p(x)}\\ &= \frac{p(x|C_1)p(C_1)}{p(x|C_1)p(C_1) + p(x|C_2)p(C_2)}\\ &= \frac{1}{1 + \frac{p(x|C_2)p(C_2)}{p(x|C_1)p(C_1)}}\\ &= \frac{1}{1 + \exp\left(\ln\frac{p(x|C_2)p(C_2)}{p(x|C_1)p(C_1)}\right)}\\ &= \frac{1}{1 + \exp\left(-\ln\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}\right)}\\ &= \frac{1}{1 + \exp\left(-w^Tx + b\right)}, \end{align} where we define $C_1$ as the event where $C=1$ and $C_2$ as the event where $C=0$. Notice this is the typical hypothesis assumed during binary logistic regression. From the above, we have that \begin{equation} w^Tx + b = \ln\frac{p(x|C_1)p(C_1)}{p(x|C_2)p(C_2)}= \ln\frac{p(x|C_1)}{p(x|C_2)} + \ln\frac{p(C_1)}{p(C_2)}. \end{equation} If, during training, we balance the dataset or weigh the examples inversely to their class prior probabilities, we effectively have that $p(C_1) = p(C_2)$, then the above becomes \begin{equation} w^Tx + b = \ln\frac{p(x|C_1)}{p(x|C_2)}. \end{equation} The parameters $w$ and $b$ are therefore estimated under the assumption that the class prior probabilities are balanced or equal. We can re-introduce the prior log odds: \begin{align} w^Tx + b + \ln\frac{p(C_1)}{p(C_2)} &= \ln\frac{p(x|C_1)}{p(x|C_2)}+\ln\frac{p(C_1)}{p(C_2)}\\ w^Tx + b' &= \ln\frac{p(x|C_1)}{p(x|C_2)}+\ln\frac{p(C_1)}{p(C_2)}, \end{align} where $b' = b + \ln\frac{p(C_1)}{p(C_2)}$. So by a simple adjustment to the bias term, we can re-introduce unbalanced priors in the test/application setting. A similar argument holds for the case of multi-class logistic regression.