Solved – Updating bias with RBMs (Restricted Boltzmann Machines)

deep learningdeep-belief-networksrestricted-boltzmann-machine

Am very new to RBMs, trying to write an RBM program now. Sorry if this is a silly question and/or answered on here already.

I've read a few articles online, and questions on here, but I can't find anything on how to update the biases (or bias weights.) I've read about setting them initially. Fleeting references to that they are to be updated. Geoff Hinton mentions bias, of course, and it features in his equations, (I've read a few of his papers, seen a couple of video presentations) but I can't find any mention in his papers of bias being altered once set, or how/when/why to do so. I must be missing something!

Edwin Chen's paper is recommended on here:
http://blog.echen.me/2011/07/18/introduction-to-restricted-boltzmann-machines/
He doesn't mention how to update/'learn' the biases; they are altered in his program but I don't understand his code. (He has a table giving the updated weights which doesn't seem right – strangely it gives a bias-bias weight, whatever that could mean.)

Um, so I'm not sure why this is so hard to find material on!!
Thanks for any help.

Best Answer

TLDR

For $n$-step Contrastive Divergence, update visible bias $b_j$, based on data vector $\mathbf{d}$ using:

$$ b_j^{(t)} \gets b_j^{(t-1)} + \eta \left( d_j - \hat{v}_j^{(n)} \right) $$

Update hidden bias $h_i$ using:

$$ c_i^{(t)} \gets c_i^{(t-1)} + \eta \left( \hat{h}_{i}^{(0)} - \hat{h}_{i}^{(n)} \right) $$

Where $b_j^{(t)}$ and $c_i^{(t)}$ are the biases after update number t, $\eta$ is the learning rate, $d_j$ is the $j$th component of the data vector, and where $\hat{h}_j^{(n)}$ and $\hat{v}_j^{(n)}$ are the probabilities of hidden unit $i$ and visible unit $j$ being active once the RBM has been exposed to the data and run for $n$ steps. This assumes a minibatch size of 1; for practical minibatch size $k$, average the updates obtained over the $k$ data vectors.


Full explanation

I had the same trouble. A good way to think of it is that the biases are themselves just weights. Often in neural network models, the bias of a unit is modeled as the weight of a link connecting the unit in question to an "always on" unit, i.e., an imaginary unit whose activation is always 1.

In the RBM case, that would mean that you think of there being one extra visible unit whose output is always 1. This visible unit attaches to each of the hidden units (just like any other visible unit does), and the weight of these connections are the biases of the respective hidden units. Similarly, the biases of the visible units can be modeled by imagining an extra hidden unit, whose value is always one, and which connects to each of the visible units, with the weights of these connections being the visible biases.

You could even implement your RBM this way, but I don't think people usually do that. The point is that, thinking about it in this way, you can use (essentially) the same update rule for the biases as you do for the weights, since biases are just weights connecting to "always on" units.

Let's be concrete. I'll write down the usual $n$-step Contrastive Divergence update rule, ignoring regularization for simplicity. Also for simplicity, this update rule is for a "minibatch" of 1 data vector. The update for a minibatch with $k$ vectors is the average update over all $k$ vectors. The update rule is:

$$ W_{i,j}^{(t)} \gets W_{i,j}^{(t-1)} + \eta\left( \hat{h}_{i}^{(0)} \cdot d_j - \hat{h}_{i}^{(n)} \cdot v_j^{(n)} \right) $$ where:

  • $W_{i,j}^{(t)}$ is the weight connecting visible unit $v_j$ to hidden unit $h_i$ after update number $t$
  • $\eta$ is the learning rate
  • $\hat{h}_{i}^{(n)}$ is the probability of hidden unit $i$ being active once the machine has been exposed to data vector $\mathbf{d}$ and evolved for $n$ steps.
    • which means that $\hat{h}_{i}^{(0)}$ is just the activation of hidden unit $i$ in immediate response to the data vector
  • $d_j$ is the $j$th component of the data vector $\mathbf{d}$
  • $v_{j}^{(n)}$ is the state of visible unit $j$ once the machine has been exposed to the data vector and evolved for $n$ steps.

(Some people use $i$ to index the visible units and $j$ to index the hidden ones, but still write $W_{i,j}$ --- it doesn't matter as long as you multiply the correct values together.)

Be careful to distinguish the "state" of a unit, denoted by $h_i^{(n)}$ or $v_j^{(n)}$, and the "activation" of a unit, denoted $\hat{h}_i^{(n)}$ or $\hat{v}_i^{(n)}$. The state of a unit is either 0 or 1, whereas the activation is any real number between 0 and 1. If the activation is 0.8, then the state is likely to be 1, but 20% of the time it will be 0.

By treating biases as weights to "always on" units, you'll find that the equation above simplifies to the ones given for bias updates under the "TLDR". There is one slight difference, however, in the update to visible biases: here the visible activation is used instead of the state. The activation has the same expected value, but has lower variance than the state, so this reduces noise in the learning signal. See this guide $\S3$ for a brief discussion of when using activations instead of states is desirable.