Solved – Derivation of Restricted Boltzmann Machine Conditional Probability

deep learningmachine learningmathematical-statisticsrestricted-boltzmann-machine

I was reading Goodfellow, Bengio, and Courville's Deep Learning and in Chapter 20 there is a derivation of the conditional probability $P(h_j=1|\boldsymbol{v})$, the conditional probability of the $j^{th}$ hidden unit being on given a data vector $\boldsymbol{v}$:

Note that $Z'$ is a normalizing constant, and $\tilde{P}(E)$ is the unnormalized probability of event $E$.

The derivation of 20.7 through 20.11 is clear to me, but for equation 20.13, we assume the inside of our product, $\exp{\{c_j h_j + \boldsymbol{v}^\top \boldsymbol{W}_{:,j} \boldsymbol{h}_j\}}$, is exactly equal to our (unnormalized) unit probability $\tilde{P}(h_j|\boldsymbol{v})$.

(This is described in the paragraph between 20.11 and 20.12, and we evaluate this $\tilde{P}(h_j|\boldsymbol{v})$ for $h_j = 0$ and $h_j = 1$ to get equation 20.13.)


My question is, why can we make this assumption?

The argument the paper seems to make is that because $P(\boldsymbol{h}|\boldsymbol{v})$ takes on the form $\frac{1}{Z'} \prod_{j=1}^{n_h} f(h_j)$, we can safely assume $f(h_j)$ must be our probability for each unit, $\tilde{P}(h_j|\boldsymbol{v})$.

I understand that $P(\boldsymbol{h}|\boldsymbol{v})$ must equal $\frac{1}{Z'} \prod_{j=1}^{n_h} \tilde{P}(h_j|\boldsymbol{v})$, but if we are given the equation $P(\boldsymbol{h}|\boldsymbol{v}) = \frac{1}{Z'} \prod_{j=1}^{n_h} f(h_j)$, couldn't there be several functions $f$ that fit this criterion? $f$ wouldn't necessarily have to be $\tilde{P}(h_j|\boldsymbol{v})$, would it?

Best Answer

20.7-11 give us first, $$ p(\mathbf{h}|\mathbf{v}) \propto \prod_j\exp{\{c_j h_j + \boldsymbol{v}^\top \boldsymbol{W}_{:,j} h_j\}}. $$ After summing out any unwanted variables, we get for any $i$, \begin{align*} p(h_i | \mathbf{v}) &\propto \sum_{\mathbf{h} : \mathbf{h_i} = h_i} \prod_j\exp{\{c_j h_j + \boldsymbol{v}^\top \boldsymbol{W}_{:,j} h_j\}} \\ &= \sum_{\mathbf{h} : \mathbf{h_i} = h_i} \left[ \prod_{j \neq i} \exp{\{c_j h_j + \boldsymbol{v}^\top \boldsymbol{W}_{:,j} h_j\}} \exp{\{c_i h_i + \boldsymbol{v}^\top \boldsymbol{W}_{:,i} h_i\}} \right] \\ &= \exp{\{c_i h_i + \boldsymbol{v}^\top \boldsymbol{W}_{:,i} h_i\}} \sum_{\mathbf{h} : \mathbf{h_i} = h_i} \left[ \prod_{j \neq i} \exp{\{c_j h_j + \boldsymbol{v}^\top \boldsymbol{W}_{:,j} h_j\}} \right] \\ &\propto \exp{\{c_i h_i + \boldsymbol{v}^\top \boldsymbol{W}_{:,i} h_i\}}, \end{align*} assuming all the sums we take are finite.

You ask if $\exp{\{c_i h_i + \boldsymbol{v}^\top \boldsymbol{W}_{:,i} h_i\}}$ constitutes an unnormalized probability distribution. The answer is yes, and these have a special name. Your link calls them "Energy Based Models" (same as exponential family models?). It is easy to see that

$$ p(h_i|\mathbf{v}) = \frac{ \exp{\{c_i h_i + \boldsymbol{v}^\top \boldsymbol{W}_{:,i} h_i\}} }{ \sum_h \exp{\{c_i h + \boldsymbol{v}^\top \boldsymbol{W}_{:,i} h\}} } $$ because $0 < \sum_h \exp{\{c_i h + \boldsymbol{v}^\top \boldsymbol{W}_{:,i} h\}} < \infty$.