Solved – an intuitive interpretation for the softmax transformation

intuitionsoftmax

A recent question on this site asked about the intuition of softmax regression. This has inspired me to ask a corresponding question about the intuitive meaning of the softmax transformation itself. The general scaled form of the softmax function $\mathbf{S}: \bar{\mathbb{R}}^{n-1} \times \mathbb{R}_+ \rightarrow \Delta^n$ is given by:

$$\mathbf{S}(\mathbf{z}, \lambda) \equiv \Bigg( \frac{1}{1 + \sum_{k=1}^{n-1} \exp(\lambda z_k)}, \frac{\exp(\lambda z_1)}{1 + \sum_{k=1}^{n-1} \exp(\lambda z_k)}, \ \cdots \ , \frac{\exp(\lambda z_{n-1})}{1 + \sum_{k=1}^{n-1} \exp(\lambda z_k)} \Bigg).$$

Is there any intuitive interpretation that describes the mapping from the extended-real value $\mathbf{z} \in \bar{\mathbb{R}}^{n-1}$ to the corresponding probability vector $\mathbf{p} \in \Delta^n$ in simple terms? I.e., is there some explanation that can describe the mapping in an intuitive way by use of geometry, analogy, etc.?

Best Answer

Intuition is a funky concept. For an ex-physicist, myself, seeing softmwax for the first time was "Ok, this is Boltzmann distribution." For a statistician it would be "Oh, isn't this mlogit?"

Physicist's intuition

Softmax is literally the case of canonical ensemble: $$ p_i=\frac 1 Q e^{- {\varepsilon}_i / (k T)}=\frac{e^{- {\varepsilon}_i / (kT)}}{\sum_{j=1}^{n}{e^{- {\varepsilon}_j / (k T)}}}$$ The denominator is called a canonical partition function, it's basically a normalizing constant to make sure the probabilities add up to 100%. But it has a physical meaning too: the system can only be in one of its M states, that's why probabilities must add up. This stuff is straight up from statistical mechanics.

The probability of a state $i$ is defined by its energy $\varepsilon_i$ relative to the energies of all other states. You see, in physics systems always try to minimize the energy, so the probability of the state with the lowest energy must be the highest. However, if the temperature of the system $T$ is high, then the difference in probabilities of the lowest energy state and other states will vanish: $$\lim_{T\to\infty}p_{min}/p_i=\lim_{T\to\infty}e^{ ({\varepsilon_i- \varepsilon}_{min}) / (k T)}=1$$

So, in OP's equation the energy $\varepsilon=-z$ and the temperature is $T\sim 1/\lambda$. He also isolates the base state, and sets its probability with 1 instead of the exponential. This doesn't change anything for intuition, it only sets all energies relative to a chosen base state. This is VERY intuitive to a physicist.

Statistician's intuition

A statistician will immediately recognize the multinomial logit regression. For those who only know bivariate logit regression, here's how mlogit works.

Estimate $n-1$ bivariate logits of $n-1$ states vs a chosen base state on the censored data set. So, you create a dataset from a base state, say 1, and one of the states $i\in[2,n]$. This way you get $n-1$ logits for each $i$, conditional ones: $$\ln\frac{Pr[i|i\cup 1]}{Pr[1|i\cup 1]}\sim X_i$$ This equation is more recognizable as: $$\ln\frac{p}{1-p}\sim X_i$$ This is how it is usually presented in bivariate cases, where there are only categories to choose from, like in our censored subset of the full dataset with $n$ categories.

Using Bayes theorem we know that: $$Pr[i|i\cup 1]=\frac{Pr[i]}{Pr[i]+Pr[1]}$$ So, we can trivially combine $n-1$ bivariate regressions into a single one to get unconditional probabilities: $$Pr[i]=\frac{e^{X_i\beta_i}}{1+\sum_i e^{X_i\beta_i}}$$ This gets us OP's equation.