Softmax Unit Derivation – How It Works and Its Implications

neural networksprobabilitysoftmax

I'm trying to understand why the softmax function is defined as such:

$\frac{e^{z_{j}}} {\Sigma^{K}_{k=1}{e^{z_{k}}}} = \sigma(z)$

I understand how this normalizes the data and properly maps to some range (0, 1) but the different between weight probabilities varies exponentially rather than linearly. Is there a reason why we want this behaviour?

Also this equation seems rather arbitrary and I feel that it a large family of equations could satisfy our requirements. I have not seen any derivations online so I'm assuming it is merely a definition. Why not choose any other definition that satisfies the same requirements?

Best Answer

The categorical distribution is the minimum assumptive distribution over the support of "a finite set of mutually exclusive outcomes" given the sufficient statistic of "which outcome happened". In other words, using any other distribution would be an additional assumption. Without any prior knowledge, you must assume a categorical distribution for this support and sufficient statistic. It is an exponential family. (All minimum assumptive distributions for a given support and sufficient statistic are exponential families.)

The correct way to combine two beliefs based on independent information is the pointwise product of densities making sure not to double-count prior information that's in both beliefs. For an exponential family, this combination is addition of natural parameters.

The expectation parameters are the expected values of $x_k$ where $x_k$ are the number of times you observed outcome $k$. This is the right parametrization for converting a set of observations to a maximum likelihood distribution. You simply average in this space. This is what you want when you are modeling observations.

The multinomial logistic function is the conversion from natural parameters to expectation parameters of the categorical distribution. You can derive this conversion as the gradient of the log-normalizer with respect to natural parameters.

In summary, the multinomial logistic function falls out of three assumptions: a support, a sufficient statistic, and a model whose belief is a combination of independent pieces of information.