There is a paper, An Introduction to Restricted Boltzmann Machines, that may help you.
For the first question,if you can understand the 9th and 30th formulas in this paper, you will get the answer.
"Why is every hidden unit being set to 1 ", I think there is no sense if hidden($H_j$) unit set $0$, because no matter what the $V_i$ and $W_{ij}$ set to be (assumption $V_i$ connected $H_j$ by parameter $W_{ij}$), the result is $V_i*W_{ij}*H_j=0$ (in the formula $E(V,H)$ used in energy model); so we hope it to be $1$.
It's not important to know the time when is the system said to be in equilibrium, equilibrium occurs when $P(H(i)|V(i))= P(H(i+L)|V(i+L))$.
TLDR
For $n$-step Contrastive Divergence, update visible bias $b_j$, based on data vector $\mathbf{d}$ using:
$$
b_j^{(t)} \gets b_j^{(t-1)} + \eta \left( d_j - \hat{v}_j^{(n)} \right)
$$
Update hidden bias $h_i$ using:
$$
c_i^{(t)} \gets c_i^{(t-1)} + \eta \left( \hat{h}_{i}^{(0)} - \hat{h}_{i}^{(n)} \right)
$$
Where $b_j^{(t)}$ and $c_i^{(t)}$ are the biases after update number t, $\eta$ is the learning rate, $d_j$ is the $j$th component of the data vector, and where $\hat{h}_j^{(n)}$ and $\hat{v}_j^{(n)}$ are the probabilities of hidden unit $i$ and visible unit $j$ being active once the RBM has been exposed to the data and run for $n$ steps. This assumes a minibatch size of 1; for practical minibatch size $k$, average the updates obtained over the $k$ data vectors.
Full explanation
I had the same trouble. A good way to think of it is that the biases are themselves just weights. Often in neural network models, the bias of a unit is modeled as the weight of a link connecting the unit in question to an "always on" unit, i.e., an imaginary unit whose activation is always 1.
In the RBM case, that would mean that you think of there being one extra visible unit whose output is always 1. This visible unit attaches to each of the hidden units (just like any other visible unit does), and the weight of these connections are the biases of the respective hidden units. Similarly, the biases of the visible units can be modeled by imagining an extra hidden unit, whose value is always one, and which connects to each of the visible units, with the weights of these connections being the visible biases.
You could even implement your RBM this way, but I don't think people usually do that. The point is that, thinking about it in this way, you can use (essentially) the same update rule for the biases as you do for the weights, since biases are just weights connecting to "always on" units.
Let's be concrete. I'll write down the usual $n$-step Contrastive Divergence update rule, ignoring regularization for simplicity. Also for simplicity, this update rule is for a "minibatch" of 1 data vector. The update for a minibatch with $k$ vectors is the average update over all $k$ vectors. The update rule is:
$$
W_{i,j}^{(t)} \gets W_{i,j}^{(t-1)} + \eta\left( \hat{h}_{i}^{(0)} \cdot d_j - \hat{h}_{i}^{(n)} \cdot v_j^{(n)} \right)
$$
where:
- $W_{i,j}^{(t)}$ is the weight connecting visible unit $v_j$ to hidden unit $h_i$ after update number $t$
- $\eta$ is the learning rate
- $\hat{h}_{i}^{(n)}$ is the probability of hidden unit $i$ being active once the machine has been exposed to data vector $\mathbf{d}$ and evolved for $n$ steps.
- which means that $\hat{h}_{i}^{(0)}$ is just the activation of hidden unit $i$ in immediate response to the data vector
- $d_j$ is the $j$th component of the data vector $\mathbf{d}$
- $v_{j}^{(n)}$ is the state of visible unit $j$ once the machine has been exposed to the data vector and evolved for $n$ steps.
(Some people use $i$ to index the visible units and $j$ to index the hidden ones, but still write $W_{i,j}$ --- it doesn't matter as long as you multiply the correct values together.)
Be careful to distinguish the "state" of a unit, denoted by $h_i^{(n)}$ or $v_j^{(n)}$, and the "activation" of a unit, denoted $\hat{h}_i^{(n)}$ or $\hat{v}_i^{(n)}$. The state of a unit is either 0 or 1, whereas the activation is any real number between 0 and 1. If the activation is 0.8, then the state is likely to be 1, but 20% of the time it will be 0.
By treating biases as weights to "always on" units, you'll find that the equation above simplifies to the ones given for bias updates under the "TLDR". There is one slight difference, however, in the update to visible biases: here the visible activation is used instead of the state. The activation has the same expected value, but has lower variance than the state, so this reduces noise in the learning signal. See this guide $\S3$ for a brief discussion of when using activations instead of states is desirable.
Best Answer
The Restricted Boltzmann Machine is an Energy - based model. The energy function produces a scalar value which basically corresponds to the configuration of the model and it is an indicator of the probability of the model being in that configuration. If the model is configured to favor low energy, then configurations leading to low energy will have a higher probability. Learning the model means looking for configurations that modify the shape of the energy function to drive to low energy configurations. As far as the biases go you basically calculate the dot product between the biases and the corresponding units (visible or hidden) to calculate their contribution to the energy function.