Solved – Softmax maximum likelihood problem: arbitrary constant

logisticmaximum likelihoodmultinomial-distribution

I'm doing a multiclassification with a softmax function. The probability of a sample $j$ belonging to class $k$ is given by the softmax:

$p_k(\mathbf{x}_j;\mathbf{w}_1,\mathbf{w}_2,…,\mathbf{w}_K)=\frac{\exp[f(\mathbf{x}_j,\mathbf{w}_k)]}{\exp[f(\mathbf{x}_j,\mathbf{w}_1)]
+ \exp[f(\mathbf{x}_j,\mathbf{w}_2)] + … + \exp[f(\mathbf{x}_j,\mathbf{w}_K)] }$

Typically I use a linear model for $f$, so that

$f(\mathbf{x},\mathbf{w}) = w_0 + w_1x_1 + … w_Kx_K$

but there are options for the choice of $f$ such as polinomials or feed forward neural networks or radial basis functions.

The optimal values of the parameters $\textbf{w}$ are found by the maximum likelihood estimation. This gives a task of minimizing a negative log-likelihood with respect to $\mathbf{w}$:

$L = -\sum\limits_{j=1}^N\sum\limits_{k=1}^{K} \delta(c_j – k)\log p_k(\mathbf{x}_j;\mathbf{w}_1,\mathbf{w}_2, …,\mathbf{w}_K$),

where first sum is over training samples and the second sum is over classes. The $c_j$ is the class label of sample $j$. The $\mathbf{x}_j$ is a vector of features describing sample $j$.

The problem is that the model $f$ may contain an arbitrary additive constant,

$f' \rightarrow f + const$

which is cancelled in the fraction for the probability. This makes the solution of optimization problem non-unique.

I've implemented the gradient descent optimization for linear model and found that depending on the initial estimate I end up with a completely different values for model parameters, but of course the estimated probabilities are identical.

Is there any way to modify the cost function or model, so that the solution of the optimization problem becomes unique?

Best Answer

One simple way to see what you've described is to compare logistic regression with the equivalent two-class multinomial regression:

In the case of logistic regression, your $f(x, w)$ has one output, lets call it $\hat{y}$.

$$ p(Y=true) = \frac{1}{1 + e^{-\hat{y}}}$$ $$ p(Y=false) = 1 - p(Y=true) = \frac{e^{-\hat{y}}}{1 + e^{-\hat{y}}} $$ $$ \log({odds}) = \log{\frac{p(Y=true)}{p(Y=false)}} = \log{e^{\hat{y}}} = \hat{y} $$

In the case of two-class multinomial regression, your $f(x, w)$ now has two outputs, lets call them $\hat{y}_1$ and $\hat{y}_2$.

$$ p(Y=true) = \frac{e^{\hat{y}_1}}{e^{\hat{y}_1} + e^{\hat{y}_2}} $$ $$ p(Y=false) = \frac{e^{\hat{y}_2}}{e^{\hat{y}_1} + e^{\hat{y}_2}} $$ $$ \log(odds) = \log{\frac{p(Y=true)}{p(Y=false)}} = \log{e^{\hat{y}_1 - \hat{y}_2}} = \hat{y}_1 - \hat{y}_2 $$

In this case, the log-odds of $Y$ are represented by the difference of the two outputs: $\hat{y}_1 - \hat{y}_2$. This system is under-constrained, since you can get the same probabilities by adding or removing a constant to both $\hat{y}_1$ and $\hat{y}_2$.

It's possible to pick, e.g. $\hat{y}_1 = 0$ so that all other classes are now constrained in relation. But in practice the problem goes away once you add a prior on your $f$ such that a single parameterization (i.e. $w = 0$) is more likely a priori.

Related Question