[Math] Derivation of softmax function

probabilityprobability distributions

I'm reading Bishop's book on Pattern Recognition and machine learning and I wanted to reproduce a calculation for the softmax function, also known as normalized exponential. Basically, the calculation requires to get the multinomial distribution into its form as a member of the exponential family:

$$p(x|\eta) = h(x)g(\eta)\exp{\{\eta^{T}u(x)\}}$$

Starting from $\exp{\{\sum_{k=1}^{M}x_{k}\ln{\mu_{k}}\}}$ and after a few steps, we recognize that $\eta_{k}$ is given by:

$$\ln{\left[{\frac{\mu_{k}}{1-\sum_{j}^{M-1}{\mu_{j}}} }\right]} = \eta_{k}$$

then it says:

which we can solve for $\mu_{k}$ by first summing both sides over $k$ and then rearranging and back-substituting to give:

$$\mu_{k}=\frac{\exp{\{\eta_{k}\}}}{1+\sum_{j}\exp{\{\eta_{j}\}}}$$

But that's not what I get. Instead, I obtained (assuming $\sum_{k}\mu_{k}=1$)

$$\mu_{k}=\frac{\exp{\{\eta_{k}\}}}{\sum_{j}\exp{\{\eta_{j}\}}}$$

Wikipedia seems to agree with my answer but I'd like to get a confirmation or correct the derivation I did.

Best Answer

The question was posted a long time ago but it could be useful for anyone else who is working through Bishop's book to note that both forms of the softmax function are equivalent since \begin{equation}1+\sum_{j=1}^{M-1}\exp{\{\eta_{j}\}}={\sum_{j=1}^M\exp{\{\eta_{j}\}}}\end{equation}

Best Answer

Related Solutions

[Math] Change of variable in a probability density function

[Math] the motivation for using cross-entropy to compare two probability vectors

Related Question