TLDR
For $n$-step Contrastive Divergence, update visible bias $b_j$, based on data vector $\mathbf{d}$ using:
$$
b_j^{(t)} \gets b_j^{(t-1)} + \eta \left( d_j - \hat{v}_j^{(n)} \right)
$$
Update hidden bias $h_i$ using:
$$
c_i^{(t)} \gets c_i^{(t-1)} + \eta \left( \hat{h}_{i}^{(0)} - \hat{h}_{i}^{(n)} \right)
$$
Where $b_j^{(t)}$ and $c_i^{(t)}$ are the biases after update number t, $\eta$ is the learning rate, $d_j$ is the $j$th component of the data vector, and where $\hat{h}_j^{(n)}$ and $\hat{v}_j^{(n)}$ are the probabilities of hidden unit $i$ and visible unit $j$ being active once the RBM has been exposed to the data and run for $n$ steps. This assumes a minibatch size of 1; for practical minibatch size $k$, average the updates obtained over the $k$ data vectors.
Full explanation
I had the same trouble. A good way to think of it is that the biases are themselves just weights. Often in neural network models, the bias of a unit is modeled as the weight of a link connecting the unit in question to an "always on" unit, i.e., an imaginary unit whose activation is always 1.
In the RBM case, that would mean that you think of there being one extra visible unit whose output is always 1. This visible unit attaches to each of the hidden units (just like any other visible unit does), and the weight of these connections are the biases of the respective hidden units. Similarly, the biases of the visible units can be modeled by imagining an extra hidden unit, whose value is always one, and which connects to each of the visible units, with the weights of these connections being the visible biases.
You could even implement your RBM this way, but I don't think people usually do that. The point is that, thinking about it in this way, you can use (essentially) the same update rule for the biases as you do for the weights, since biases are just weights connecting to "always on" units.
Let's be concrete. I'll write down the usual $n$-step Contrastive Divergence update rule, ignoring regularization for simplicity. Also for simplicity, this update rule is for a "minibatch" of 1 data vector. The update for a minibatch with $k$ vectors is the average update over all $k$ vectors. The update rule is:
$$
W_{i,j}^{(t)} \gets W_{i,j}^{(t-1)} + \eta\left( \hat{h}_{i}^{(0)} \cdot d_j - \hat{h}_{i}^{(n)} \cdot v_j^{(n)} \right)
$$
where:
- $W_{i,j}^{(t)}$ is the weight connecting visible unit $v_j$ to hidden unit $h_i$ after update number $t$
- $\eta$ is the learning rate
- $\hat{h}_{i}^{(n)}$ is the probability of hidden unit $i$ being active once the machine has been exposed to data vector $\mathbf{d}$ and evolved for $n$ steps.
- which means that $\hat{h}_{i}^{(0)}$ is just the activation of hidden unit $i$ in immediate response to the data vector
- $d_j$ is the $j$th component of the data vector $\mathbf{d}$
- $v_{j}^{(n)}$ is the state of visible unit $j$ once the machine has been exposed to the data vector and evolved for $n$ steps.
(Some people use $i$ to index the visible units and $j$ to index the hidden ones, but still write $W_{i,j}$ --- it doesn't matter as long as you multiply the correct values together.)
Be careful to distinguish the "state" of a unit, denoted by $h_i^{(n)}$ or $v_j^{(n)}$, and the "activation" of a unit, denoted $\hat{h}_i^{(n)}$ or $\hat{v}_i^{(n)}$. The state of a unit is either 0 or 1, whereas the activation is any real number between 0 and 1. If the activation is 0.8, then the state is likely to be 1, but 20% of the time it will be 0.
By treating biases as weights to "always on" units, you'll find that the equation above simplifies to the ones given for bias updates under the "TLDR". There is one slight difference, however, in the update to visible biases: here the visible activation is used instead of the state. The activation has the same expected value, but has lower variance than the state, so this reduces noise in the learning signal. See this guide $\S3$ for a brief discussion of when using activations instead of states is desirable.
That text, if it's this one, seems to be alluding to information theory entropy. By definition, it's the expectation of negative log of probability, in which case "find the expectation" is what he means by "estimate."
For some intuition on why this measure makes sense, consider this from the wiki introduction:
The entropy of the message is its amount of uncertainty; it increases when the message is closer to random, and decreases when it is less random. The idea here is that the less likely an event is, the more information it provides when it occurs.
In essence, Hinton seems to suggest that the more uncertain you are of the inputs, the more hidden units you should have. This gels well with his conclusion to the same paragraph:
If the training cases are highly redundant, as they typically will be for very big training sets, you need to use fewer parameters.
To the question of calculation, I don't see how one can do this without assuming some distribution for the vector. For instance, say your data are images of digits from 0-9, and that your input encodes pixels, in the form of a 256-length vector of binary values. If you assume every pixel is as likely to be a 0 as a 1, then this value will be 256.
But say you learn that the values nearer the corners are less likely, sensible given that most writers of western digits won't near the corners. If you adjust those probabilities down and the others up, you'll find that entropy falls slightly. That is, your measure of uncertainty drops. This example is a rather ad hoc application of subjective domain knowledge, but you could also be more rigorous. E.g., estimate the distribution from the training data.
Best Answer
Not sure if you are looking for a clearer introduction to RBM or just an example.
If you prefer to have a look at an actual implementation, I have implemented the RBM in Matlab because I found existing implementations to lack modularity for my own needs. You can find it here: https://github.com/pixelou/nnbox/blob/master/networks/RBM.m (pretrain method starts at line 119). I cannot guaranty that it is bug free though :-)
For a comprehensive introduction to Restricted boltzmann machines, you can have a look at Training restricted Boltzmann machines: An introduction from Asja Fischer & Christian Igel, this is the clearest paper in terms of proofs and structure. But I found it a bit hard to follow on the first read so you can instead split your reading into several stages:
1 - consider the problem of learning a model (RBM) as a minimization of the distance between the parametric pdf of the RBM and the underlying dataset distribution.
2 - compute the gradient of the distance so that you can do gradient descent
I have detailed the computation here: How to derive the gradient formula for the Maximum Likelihood in RBM?
3 - notice that you cannot compute that expression because one part is intractable :-). Use Contrastive divergence (or any other method but CD is the most popular) to approximate this value. You can read A New Learning Algorithm for Mean Field Boltzmann Machines from Welling & Hinton (2001) which is one of the early publications on CD.
4 - Use that gradient to perform gradient update, adjust with regularization and tricks. For that matter, a Practical Guide to Training Restricted Boltzmann Machines from Hinton (2010) is a must read.