Solved – Expectation of the softmax transform for Gaussian multivariate variables

approximationexpected valuelogisticsoftmax

Prelims

In the article Sequential updating of conditional probabilities on directed graphical structures by Spiegelhalter and Lauritzen they give an approximation to the expectation of a logistic transformed Gaussian random variable $\theta \sim N(\mu, \sigma^2)$. This uses the Gaussian cdf function $\Phi$ in the approximation

$$ \exp(\theta)/(1 + \exp(\theta)) \approx \Phi(\theta \epsilon) $$

for an appropriately chosen $\epsilon$ (in their case they chose $\epsilon = 0.607$). Hence

$$ \mathbb{E} \left [ \exp(\theta)/(1 + \exp(\theta))\right ] \approx \int_{- \infty}^{\infty} \Phi(\theta \epsilon) \phi(\theta | \mu, \sigma^2) d \theta$$

where $\phi$ is a Gaussian pdf function. The integral can be written as

$$ \int_{\infty}^{\infty} \Pr(U < 0 | \theta) \phi(\theta|\mu, \sigma^2) d\theta $$

where $U \sim N(-\theta, \epsilon^{-2})$ and the integral is then simply the marginal $\Pr(U < 0)$. Note that as $\theta \sim N(\mu, \sigma^2)$, we have $U \sim N(-\mu, \sigma^2 + \epsilon^{-2})$. Hence

$$ \mathbb{E} \left [ \exp(\theta)/(1 + \exp(\theta))\right ] \approx \Pr(U < 0) = \Phi(\frac{\mu}{\sqrt{\sigma^2 + \epsilon^{-2}}})$$

We can then use the initial approximation in the reverse direction to get

$$ \mathbb{E} \left [ \exp(\theta)/(1 + \exp(\theta))\right ] \approx \exp(c \mu)/(1 + \exp(c \mu)) $$

where $c = (1 + \epsilon^2 \sigma^2)^{-1/2}$.

Question

My question is, are there any approximations to the expectation of a softmax transformation of Gaussian multivariate variables. In particular, let

$$ \boldsymbol{Z} \sim MVN(\boldsymbol{\mu}, \Sigma) \in \mathbb{R}^{n} $$

Define the $k$ activations for each discrete outcome as

$$ f_i(\boldsymbol{Z}, \boldsymbol{w}_i) = \boldsymbol{w}_i^T \boldsymbol{Z} $$

Finally define our softmax transformed activations as
$$ P_i(\boldsymbol{Z}) = \frac{\exp(f_i(\boldsymbol{Z}, \boldsymbol{w}_i))}{\sum_{j=1}^k \exp(f_j(\boldsymbol{Z}, \boldsymbol{w}_j))} $$

What I want is an estimate to the expectation
$$ \mathbb{E}[P_i(\boldsymbol{Z})] $$

Note that in the case $k=2$, we have

$$ P_1(\boldsymbol{Z}) = \frac{\exp(f_1(\boldsymbol{Z}, \boldsymbol{w}_1))}{ \exp(f_1(\boldsymbol{Z}, \boldsymbol{w}_1)) + \exp(f_2(\boldsymbol{Z}, \boldsymbol{w}_2))} $$

Therefore

$$ P_1(\boldsymbol{Z}) = \frac{\exp(f_1(\boldsymbol{Z}, \boldsymbol{w}_1) – f_2(\boldsymbol{Z}, \boldsymbol{w}_2))}{ \exp(f_1(\boldsymbol{Z}, \boldsymbol{w}_1)- f_2(\boldsymbol{Z}, \boldsymbol{w}_2)) + 1} $$

and as $f_1(\boldsymbol{Z}, \boldsymbol{w}_1) – f_2(\boldsymbol{Z}, \boldsymbol{w}_2)$ is simply the sum of correlated Gaussian random variables, it is also Gaussian distributed. Hence we can use the initial approximation.

Can we generalise for $k > 2$?

Best Answer

I am sorry if I rescue a fairly old question but I was facing a very similar problem recently and I stumble upon a paper that might offer some help. The article is: "Semi-analytical approximations to statistical moments of sigmoid and softmax mappings of normal variables" at https://arxiv.org/pdf/1703.00091.pdf

Expectation of Softmax approximation

For computing the average value of a softmax mapping $\pi \left( \mathbf{\mathsf{x}} \right)$ of multi-normal distributed variables $\mathbf{\mathsf{x}} \sim \mathcal{N}_D \left( \mathbf{\mu}, \mathbf{\Sigma} \right)$ the author provides the following approximation:

$$ \mathbb{E} \left[ \pi^k (\mathbf{\mathsf{x}}) \right] \simeq \frac{1}{2 - D + \sum_{k' \neq k} \frac{1}{\mathbb{E} \left[ \sigma \left( x^k - x^{k'} \right) \right]}} $$

Where $x^k$ represents the $k$-component of the $\mathbf{\mathsf{x}}$ D-dimensional vector and $\sigma \left( x \right)$ represent the one-dimensional sigmoidal function. To evaluate this formula one needs to compute the average value $\mathbb{E} \left[ \sigma (x) \right]$ for which you could use your own approximation (a very similar approximation is again provided in the aformentioned article).

This formula is based on a re-writing of the softmax formula in terms of sigmoids and starts from the $D=2$ case you mentioned where the result is "exact" (as much as an approximation can be) and postulate the validity of their expression for $D>2$. They validate their proposal by means of numerical validation.

Related Question