There is a paper, An Introduction to Restricted Boltzmann Machines, that may help you.
For the first question,if you can understand the 9th and 30th formulas in this paper, you will get the answer.
"Why is every hidden unit being set to 1 ", I think there is no sense if hidden($H_j$) unit set $0$, because no matter what the $V_i$ and $W_{ij}$ set to be (assumption $V_i$ connected $H_j$ by parameter $W_{ij}$), the result is $V_i*W_{ij}*H_j=0$ (in the formula $E(V,H)$ used in energy model); so we hope it to be $1$.
It's not important to know the time when is the system said to be in equilibrium, equilibrium occurs when $P(H(i)|V(i))= P(H(i+L)|V(i+L))$.
TLDR
For $n$-step Contrastive Divergence, update visible bias $b_j$, based on data vector $\mathbf{d}$ using:
$$
b_j^{(t)} \gets b_j^{(t-1)} + \eta \left( d_j - \hat{v}_j^{(n)} \right)
$$
Update hidden bias $h_i$ using:
$$
c_i^{(t)} \gets c_i^{(t-1)} + \eta \left( \hat{h}_{i}^{(0)} - \hat{h}_{i}^{(n)} \right)
$$
Where $b_j^{(t)}$ and $c_i^{(t)}$ are the biases after update number t, $\eta$ is the learning rate, $d_j$ is the $j$th component of the data vector, and where $\hat{h}_j^{(n)}$ and $\hat{v}_j^{(n)}$ are the probabilities of hidden unit $i$ and visible unit $j$ being active once the RBM has been exposed to the data and run for $n$ steps. This assumes a minibatch size of 1; for practical minibatch size $k$, average the updates obtained over the $k$ data vectors.
Full explanation
I had the same trouble. A good way to think of it is that the biases are themselves just weights. Often in neural network models, the bias of a unit is modeled as the weight of a link connecting the unit in question to an "always on" unit, i.e., an imaginary unit whose activation is always 1.
In the RBM case, that would mean that you think of there being one extra visible unit whose output is always 1. This visible unit attaches to each of the hidden units (just like any other visible unit does), and the weight of these connections are the biases of the respective hidden units. Similarly, the biases of the visible units can be modeled by imagining an extra hidden unit, whose value is always one, and which connects to each of the visible units, with the weights of these connections being the visible biases.
You could even implement your RBM this way, but I don't think people usually do that. The point is that, thinking about it in this way, you can use (essentially) the same update rule for the biases as you do for the weights, since biases are just weights connecting to "always on" units.
Let's be concrete. I'll write down the usual $n$-step Contrastive Divergence update rule, ignoring regularization for simplicity. Also for simplicity, this update rule is for a "minibatch" of 1 data vector. The update for a minibatch with $k$ vectors is the average update over all $k$ vectors. The update rule is:
$$
W_{i,j}^{(t)} \gets W_{i,j}^{(t-1)} + \eta\left( \hat{h}_{i}^{(0)} \cdot d_j - \hat{h}_{i}^{(n)} \cdot v_j^{(n)} \right)
$$
where:
- $W_{i,j}^{(t)}$ is the weight connecting visible unit $v_j$ to hidden unit $h_i$ after update number $t$
- $\eta$ is the learning rate
- $\hat{h}_{i}^{(n)}$ is the probability of hidden unit $i$ being active once the machine has been exposed to data vector $\mathbf{d}$ and evolved for $n$ steps.
- which means that $\hat{h}_{i}^{(0)}$ is just the activation of hidden unit $i$ in immediate response to the data vector
- $d_j$ is the $j$th component of the data vector $\mathbf{d}$
- $v_{j}^{(n)}$ is the state of visible unit $j$ once the machine has been exposed to the data vector and evolved for $n$ steps.
(Some people use $i$ to index the visible units and $j$ to index the hidden ones, but still write $W_{i,j}$ --- it doesn't matter as long as you multiply the correct values together.)
Be careful to distinguish the "state" of a unit, denoted by $h_i^{(n)}$ or $v_j^{(n)}$, and the "activation" of a unit, denoted $\hat{h}_i^{(n)}$ or $\hat{v}_i^{(n)}$. The state of a unit is either 0 or 1, whereas the activation is any real number between 0 and 1. If the activation is 0.8, then the state is likely to be 1, but 20% of the time it will be 0.
By treating biases as weights to "always on" units, you'll find that the equation above simplifies to the ones given for bias updates under the "TLDR". There is one slight difference, however, in the update to visible biases: here the visible activation is used instead of the state. The activation has the same expected value, but has lower variance than the state, so this reduces noise in the learning signal. See this guide $\S3$ for a brief discussion of when using activations instead of states is desirable.
Best Answer
I remember struggling with a similar thing. Did you look at the matlab implementation of the Hinton's Gaussian RBM? I guess that's for Gaussian hidden units and binary visible units, so it's the other way but still relevant.
In section 13.2 of his training paper he has a different formula for the energy, where the visible units have mean $a_i$ and variance $\sigma_i^2$. That's what he means by linear and independent noise. Each visible unit has those two parameters associated with it. Other than that, you are right in saying that you "use the simple weighted sum of hiddens (plus bias)".
Furthermore, he goes on and says that in practice one should normalize the data to have zero mean and unit variance. See this answer on how to achieve that. So in reality/practice you choose 1 as unit variance for the energy formula and perform the normalization step on your data.
Patric