Training a Boltzmann Machine (Non restricted)

calculusmachine learningproof-explanationstatistical-inferencestatistics

I'm reading through Neural Network and Deep Learning by Charu C. Aggarwal, more specifically chapter 6 which treats Restricted Boltzmann Machines, section 6.3. treats Boltzmann Machines.

I'll try to summarize the content of the section (because I'd like an answer formulated with a similar notation, but that's not essential) and I'll ask the specific question.

A Boltzmann Machine (BM) is essentially a probabilistic model expressed as an undirected graph. If $i$ and $j$ are two nodes of such graph then $w_{ij} = w_{ji}$ is the associated weight for the arc $(i,j)$, it is also assumed $w_{ii} = 0$, so there's no self loop (no arc starts and ends in the same node).

The states that a BM can model are both visible and hidden, i.e. if $q = m + d$ is the total number of states the $d$ are the visible ones and $m$ are the hidden ones. The state can be represented as a pair of vectors $(\vec{v},\vec{h})$, ($\vec{v}$ are the visible states and $\vec{h}$ are the hidden states). The pair represent the sets of all possible states $(\vec{v},\vec{h}) = \vec{s} = (s_1,\ldots,s_q)$.

An important concept in BM is the energy defined as

$$
E = – \sum_i b_i s_i – \sum_{i,j: i < j} w_{ij} s_i s_j, \;\;\;(1)
$$

and such energy leads to the definition of energy gap

$$
\Delta E_i = E_{s_i = 0} – E_{s_i = 1} = b_i + \sum_{j : j \neq i} w_{ij} s_j,
$$

the actual expression follows from a simple calculation. The probability of a specific state is defined as

$$
P(\vec{s}) = \frac{e^{-E(\vec{s})}}{Z}, \;\;\;\; (2)
$$

where $E(\vec{s})$ is given by $(1)$ and
$$
Z = \sum_{\vec{s}} e^{-E(\vec{s})}.
$$

Also observe that thanks to the Bayes rule we get:

$$
P(s_i = 1 | s_1,\ldots,s_{i-1},s_{i+1},\ldots s_{q}) = \frac{1}{1 + e^{-\Delta E_i}}
$$

and such probability can be used to apply either Gibbs Sampling or Markov Chain Monte Carlo to generate samples.

Now the bit I don't understand

To perform training we want to maximize the likelihood

$$
log(P(\vec{s}) = – E(\vec{s}) – \log Z
$$

which is fine, but the following equation is drawn out of nowhere

$$
\frac{\partial P(\vec{s})}{ \partial w_{ij}} = \left\langle s_i,s_j \right\rangle_{data} – \left\langle s_i, s_j \right\rangle_{model} \;\;\;\; (3)
$$

Quoting the book now

Here $\left\langle s_i,s_j \right\rangle_{data}$ represents the averaged value of $s_i s_j$ obtained by running the generative process of section 6.1.3, when the visible states are clamped to attribute values in a training point. The averaging is done over a mini-batch of training points. Similarly $\left\langle s_i,s_j \right\rangle_{model}$ represents the averaged value $s_i s_j$ at thermal equilibrium without fixing visible states to training points and simply running the generative process of Section 6.3.1.

I've tried to derive the equation by myself but it doesn't lead me to the same expression. Therefore can anyone show me how the equation is derived or point out to a very good reference?

The book points to this paper, which I assume is the original one, but it makes use of the KL-Divergence but such measure is not mentioned anywhere.

Can anyone show me how to derive the learning algorithm then?

Best Answer

Let $V$ and $H$ denote the sets of visible and hidden indices. We use $s_V := \{ s_i\}_{i \in V}$ and $s_H := \{s_i \}_{i \in H}$ to denote the entire collections of visible and hidden variables.

Suppose $s_V^\star := \{s_i^\star\}_{i\in V}$ are the values taken by the visible variables $s_V$ in our data. Our objective is to tune the $w_{ij}$ and $b_i$ parameters so as to optimise the log-likelihood of observing this data: \begin{align*} \log P(s^\star_{V}) &= \log \left(\sum_{s_{H}} P(s^\star_{V}, s_{H}) \right) \\ &= \log \left( \frac{\sum_{s_{H}} e^{-E(s^\star_{ V}, s_{H})} }{\sum_{s} e^{-E(s)} }\right).\end{align*}

We solve this optimisation problem by gradient descent.

Taking derivatives respect to $w_{ij}$, we have

\begin{align*} \frac{\partial \log P(s^\star_V)}{\partial w_{ij}} &= - \frac{\sum_{s_{H}} \frac{\partial E(s^\star_{V}, s_{ H})}{\partial w_{ij}}e^{-E( s^\star_{V}, s_{ H})} }{\sum_{ s_{H}} e^{-E(s^\star_{ V}, s_{ H})} } + \frac{\sum_{s} \frac{\partial E(s)}{\partial w_{ij}}e^{-E(s)} }{\sum_{s} e^{-E(s)} } \\ &= - \sum_{s_{ H}} \left(\frac{\partial E(s^\star_{ V}, s_{ H})}{\partial w_{ij}} P\left( s_{ H} \mid s^\star_{V}\right)\right)+ \sum_{s} \left(\frac{\partial E(s)}{\partial w_{ij}} P\left( s\right)\right) \\ &= \mathbb E_{s_{H} \sim P\left(s_{H} \mid s^\star_{V}\right)} \left[s_i s_j|_{s_V = s^\star_V}\right]- \mathbb E_{s \sim P(s)}\left[s_i s_j \right] \end{align*} In your paper's notation, $$ \langle s_i s_j \rangle_{\rm data} := \mathbb E_{s_{H} \sim P\left(s_{H} \mid s^\star_{V}\right)} \left[s_i s_j|_{s_V = s^\star_V}\right],$$ $$\langle s_i s_j \rangle_{\rm model } := \mathbb E_{s \sim P(s)}\left[s_i s_j \right], $$ which completes the derivation.

[Note: $$P\left( s_H \mid s_V^\star \right) = \frac{e^{-E(s_V^\star, s_H)}}{\sum_{s_H}e^{-E(s_V^\star, s_H)}}$$ is the conditional distribution for the hidden variables $s_H$, given the values $s_V^\star$ for the visible variables observed in the data. To sample from this distribution, you clamp the values of $s_i$ for $i\in V$ at the values $s_i^\star$ observed in the data, and only perform the Gibbs sampling for the variables $s_i$ with $i\in H$.]

Related Question