Solved – Getting probability from Restricted Boltzmann Machine

deep learningmachine learningprobabilityrestricted-boltzmann-machine

Let's consider a trained Restricted Boltzmann Machine model. It was trained to maximize P(v). Since it's a generative model, how can I get a probability of an input vector which it is supposed to model? I know for a fact that I can determine one using the following equation, but it is the same as in Boltzmann Machines. Does "restriction" only improve learning?

$$
P(v) = \frac{\sum_{h} e^{-E(v,h)}}{\sum_{u}\sum_{g}e^{-E(u,g)}}
$$

My first thought for approximating the numerator was to clamp visible units, see how the hidden units are changing and record the most common value of E. When the visible units are clamped I am at equilibrium. I can do similiar for the denominator, but is there any way around?

EDIT: I found expression for a free energy in A practical guide to training restricted Boltzmann machines (Hinton 2010), but don't understand. May someone explain?

Best Answer

The "Restricted" in Restricted Boltzmann Machine (RBM) refers to the topology of the network, which must be a bipartite graph. This means the nodes can be partitioned into two distinct groups, $V$ and $H$ ("visible" vs. "hidden"), such that all connections have one end in each group, i.e. there are no connections between nodes in the same group. A generic Boltzmann Machine does not have this restriction (so you cannot necessarily distinguish "visible" vs. "hidden" node groups based on the connectivity).

Now, as for estimating the probability of an "input vector" (assignment of labels to $v$?), your equation is correct I believe*. Note that the denominator is a (normalizing) constant for the trained RBM, known as the "partition function" and commonly denoted $Z$. For an RBM the joint distribution of $v$ and $h$ is defined by $$P(v,h)=\frac{1}{Z}e^{-E(v,h)}$$ where $E(v,h)$ is called the "free energy" of the RBM. The equation you wrote is simply the marginal distribution of $v$, obtained in the usual way from the joint distribution by summing over $h$ (i.e. once you know $P(v,h)$, there is nothing RBM-specific here).

The "restriction" allows efficient learning, yes, but it could also affect inference. The restriction means that in $E(v,h)=E([v_1,\ldots,v_m],[h_1,\ldots,h_n])$ there are no couplings between $v_i$'s or $h_j$'s. In terms of inference, this means that the visible nodes $v_i$ and $v_j$ are conditionally independent of each other given $h$, i.e. $p(v_i,v_j\mid h)=p(v_i\mid h)p(v_j\mid h)$. (And the same holds for $h_i$ and $h_j$, given $v$.)

So the conditional distribution for $v$ factors as $$P(v\mid h)=\prod_ip(v_i|h)$$ (and similarly for $P(h\mid v)$). The RBM is designed to be able to easily compute these conditional probabilities (and sample from the corresponding conditional distributions).

The marginal probability of $v$ can in principle be computed as an expected value $$P[v]=\sum_h P(v\mid h)\,P(h)=\mathbb{E}_h[P(v\mid h)]$$ so there is probably some way to estimate it via Monte Carlo (e.g. block Gibbs sampling, if I am interpreting this summary correctly)?

All of this is far outside my expertise, as I do not do neural networks, graphical models, or MCMC. So I welcome any corrections from those that do!