I am trying to follow the original paper of GE Hinton: Training Products of Experts by Minimizing Contrastive Divergence
However I can't verify equation (5) where he says:
$$
-\frac{\partial}{\partial \theta_m}\left(Q^0 || Q^\infty-Q^1 || Q^\infty\right) = \left\langle\frac{\partial \log p_{m}(\mathbf{d} | \theta_m)}{\partial \theta_m}\right\rangle_{Q^0}-\left\langle\frac{\partial \log p_{m}(\hat{\mathbf{d}} | \theta_m)}{\partial \theta_m}\right\rangle_{Q^1} \nonumber \\
+\frac{\partial Q^1}{\partial \theta_m} \frac{\partial Q^1 ||Q^\infty}{\partial Q^1}
$$
I am not sure if this is the right place to ask but I almost derived the equation however it does not exactly match the paper so I must be missing something?
My approach so far:
The second term of the RHS is of course the same as the third term of the LHS by introducing the factor $1 = \frac{\partial Q^1}{\partial Q^1}$. So one only has to consider the derivative of the first KL-divergence term. I use the result of equation (3):
$$ Q^0||Q^\infty =\sum_\mathbf{d} Q_\mathbf{d}^0 \log Q_\mathbf{d}^0-\sum_\mathbf{d} Q_\mathbf{d}^0 \log Q_\mathbf{d}^\infty=-H\left(Q^0\right)-<\log Q_\mathbf{d}^\infty>_{Q^0}$$
in the paper and the fact, that the original data distribution $Q^0$ is independent of the model parameters $\theta_m$, thus the partial derivative of the entropy of the data (denoted by $H(Q^0)$) w.r.t. the model parameters vanishes:
$$
-\frac{\partial}{\partial \theta_m} Q^0||Q^\infty
= \frac{\partial}{\partial \theta_m} \left (H\left(Q^0\right)+<\log Q_\mathbf{d}^\infty>_{Q^0} \right)
= \frac{\partial}{\partial \theta_m} <\log Q_\mathbf{d}^\infty >_{Q^0}
\\= \sum_\mathbf{d} Q_\mathbf{d}^0 \frac{\partial}{\partial \theta_m} \log(Q_\mathbf{d}^\infty)
= \left \langle \frac{\partial \log Q_\mathbf{d}^\infty }{\partial \theta_m} \right \rangle_{Q^0}
$$
In the next step I can use equation (4):
$$
\left\langle\frac{\partial \log Q_\mathbf{d}^\infty}{\partial \theta_m}\right\rangle_{Q^0} =\left\langle\frac{\partial \log p_{m}(\mathbf{d} | \theta_m)}{\partial \theta_m}\right\rangle_{Q^0}-\left\langle\frac{\partial \log p_{m}(\mathbf{c} | \theta_m)}{\partial \theta_m}\right\rangle_{Q^\infty}
$$
of the paper. If we now compare this result with the first two terms of the RHS of equation (5) only the second expectations differ:
$$
\left\langle\frac{\partial \log p_{m}(\mathbf{c} | \theta_m)}{\partial \theta_m}\right\rangle_{Q^\infty} \neq \left\langle\frac{\partial \log p_{m}(\hat{\mathbf{d}} | \theta_m)}{\partial \theta_m}\right\rangle_{Q^1}
$$
So is my interpretation right that we approximate the expectation over $Q^\infty$ with the expectation over $Q^1$? But in the sentence before equation (5) Hinton says:
The mathematical motivation for the contrastive divergence is that the intractable expectation over $Q^\infty$ on the RHS of Eq. 4 cancels out
However there is nothing to cancel out? What am I doing wrong here? Or is my initial starting point equation (5) already incorrect?
I am glad if anyone could help me understand the steps the author made to arrive at equation (5).
Best Answer
A few minor issues of notation:
Each term in the third line corresponds to a term in the second line. You can see that the average of the energy gradient under the model distribution ($p^\infty$) appears twice, but with opposite signs, and so does indeed cancel.