Solved – Why and what induces degeneracy in the solution space in logistic regression models with respect to the energy landscape of the logistic loss

classificationlogisticmachine learningoptimization

I was trying to understand the sources of having many (equivalent) solutions $w$ with respect to the training set in the context of logistic regression (with a predictor of the form $h_{w}(x) = \text{sigmoid}(y w^\top x) $ ). The conditions for degeneracy in the energy landscape is what interests me most and what conditions does the problem need to have (data, model complexity, the way true distribution looks like, etc) for there to be more than 1 unique minimizer (or more than 1 unique limit point minimizer). Thus, what matters is the set of parameters that achieves minimum loss (not zero classification error). There might be many and the conditions for this is what I am interested in.

Recall that the loss function (for the 2 classes cases, otherwise use softmax with cross-entropy loss) is:

$$ J(w; X,Y) = \frac{1}{N} \sum^N_{n=1} \ln( 1 + e^{y^{(n)} w^\top x^{(n)}} )$$

in this problem I am wondering what induces having many solutions (or not).

My hypothesis are as follow:

The number of data points matter. Only when we have $N_{train} =
\infty$ can we have a unique solution come out from logistic
regression. This is obvious because then we essentially know the
true hyperplane. If we lack data my intuition tells me we need some
other sort regularizer to choose a unique solution. The funny thing
is that because of the log the error keeps going down forever. So
perhaps a better way to say "unique" solution is to say unique in
the limit.
Even Bishop's book mentions that even if $N > D$ over constrained, its possible that there are infinite different solution (but the problem remains convex!?)
When the model is "too complex" then there are an infinite set of solutions. For example, if the data lies in $\mathbb{R}^{D^*}$
but the model we have has more features say lies in $\mathbb{R^D}$
where $D > D^*$, then things are problematic. This is because the
new features can be changed to sum to zero without actually changing
anything.
Since logistic regression, we only care about the sign. Thus we can always scale our solution $\hat w$ arbitrarily without changing
the decision boundary. This seems to always be true no matter
what…so maybe there are always infinite number of solutions in
logistic regression?
Is there very a unique minimizer in logistic regression? The fact that the loss is always decreasing makes me think that
no…(unless a limit exists?)
Assume separability. Even in this case, there could be an infinite number o

f solution unless we satisfy my first point $N_{train} = \infty$.

So my question is: when (and the whys) of when logistic regression has an infinite set of solution (and what conditions do we need)?

The reason the questions linked are not helpful (or wikipedia, which as a matter of fact does have a list where a model may not reach convergence which probably is related to my question) is because I don't just want to know what induces degeneracy, but also why.

For example, if a solution is possible to be completely separable, then how does that induce a degenerate sol? For example, in linear regression its clear that its induced due to a null space, so any solution can be expressed as

$$ \hat x = x_{particular} + \alpha x_{nullspace}$$

which makes it easy to understand where the flatness of the landscape comes from. But for logistic regression, this isn't clear. How does say, rank directly affect the solution of the logistic regression model?

As I was reading Bishop section 4.3.2 I found the following statement:

Note that the problem will arise even if the number of data points is large compared with the number of parameters in the model, so long as the training data is linearly separable.

This means that the number of data points doesn't play an important role. A good answer should explain this too.

Best Answer

I think that your question hinges on two concepts.

1. Convexity

It can be simpler to consider logistic regression model as the composition of two steps. Suppose that $y\in\{0,1\}$ are the outcomes we'd like to model as a multilinear function of some features $X$.

\begin{align} z &= X\beta \\ p &= \text{logit}^{-1}(z) \\ y &\sim \text{Bernoulli}(p) \end{align}

From the definition of nullspace, if the nullspace of $X$ is nonempty, then one can choose some pair $\beta^\prime \neq \beta$ for which $X\beta = X\beta^\prime$.

Since the product is the same, so is $z=X\beta=X\beta^\prime$ also the same for either choice of coefficients.

Since $z$ is the same, either choice of coefficients has the same success probability $p$.

Since the success probability is the same, the likelihood (or cross-entropy loss, if you prefer) is the same for either choice of coefficients.

This implies that the optimization problem is not uniquely optimized when the nullspace of $X$ is nonempty.

2. Separability

Consider the binary cross-entropy loss. \begin{align} \mathcal{L}(\theta) &= -\frac{1}{n}\sum_{i=1}^n \left[y_i \log(p_i) + (1-y_i) \log(1-p_i)\right] \end{align}

For a single observation, for $y_j = 1$, the loss will get smaller as $p$ gets closer to $1$, so that means $z_j$ will have to get larger, implying that $\beta$ is moving in some direction to improve model fit. For $y_k = 0$, the loss will get smaller as $p$ gets closer to $0$, so $z_k$ will have to get smaller. (The logistic function has horizontal asymptotes at 0 and 1, so $\log(0)$ is never attained for $||\beta|| < \infty$.)

However, when separability is present, there is a set of features for which you can do both of these things at the same time.

These are some examples.

The simplest form is that a feature takes values $\pm 1$, and all positive values correspond to the positive class and all negative values to the negative class. Making its corresponding $\beta_i$ larger will always move $z$ in the "right" direction for all examples.
We can generalize the simple binary example to the case of a real-valued feature, if $x > 0$ corresponds to $y = 1$ and $x < 0$ to $y = 0$. It should be clear now that the same pathological behavior is present, since making the coefficient larger will always move $z$ in the "right" direction for both classes.
Finally, we can generalize to linear combinations of features. Suppose you're fitting a model with an intercept and a real-valued feature. If $x > c$ has $y=1$ and $x < c$ has $y=0$, choose the coefficient for the intercept as $-c$. In this case, we're back in the same situation as (2).

When separability is not present, this is not possible. Suppose that instead of $y$ always "matching" the feature, the label and feature "match" in all but 1 instance. In that case, making $\beta$ larger will not always improve the loss. In this case, $\beta$ is finite.

Naturally, placing a constraint on $\| \beta \|$ implies that it cannot grow without bound. Ridge, lasso and elastic net are common choices for penalizing $\beta$ and preventing it from growing arbitrarily large.

3. Convexity & Separability

It's possible that one, both or neither of these effects is present in any particular problem. Moreover, strongly convex, non-separable logistic regression has a unique, finite minimizer.

Best Answer

1. Convexity

2. Separability

3. Convexity & Separability

Related Solutions

Logistic Regression – The Correct Loss Function for Logistic Regression

R – What Happens When Using Squared Loss in a Logistic Regression Setting?

Related Question