Derivation Contrastive Divergence

artificial intelligencemachine learningmonte carloprobability

I am trying to follow the original paper of GE Hinton: Training Products of Experts by Minimizing Contrastive Divergence

However I can't verify equation (5) where he says:

$$
-\frac{\partial}{\partial \theta_m}\left(Q^0 || Q^\infty-Q^1 || Q^\infty\right) = \left\langle\frac{\partial \log p_{m}(\mathbf{d} | \theta_m)}{\partial \theta_m}\right\rangle_{Q^0}-\left\langle\frac{\partial \log p_{m}(\hat{\mathbf{d}} | \theta_m)}{\partial \theta_m}\right\rangle_{Q^1} \nonumber \\
+\frac{\partial Q^1}{\partial \theta_m} \frac{\partial Q^1 ||Q^\infty}{\partial Q^1}
$$

I am not sure if this is the right place to ask but I almost derived the equation however it does not exactly match the paper so I must be missing something?

My approach so far:

The second term of the RHS is of course the same as the third term of the LHS by introducing the factor $1 = \frac{\partial Q^1}{\partial Q^1}$. So one only has to consider the derivative of the first KL-divergence term. I use the result of equation (3):

$$ Q^0||Q^\infty =\sum_\mathbf{d} Q_\mathbf{d}^0 \log Q_\mathbf{d}^0-\sum_\mathbf{d} Q_\mathbf{d}^0 \log Q_\mathbf{d}^\infty=-H\left(Q^0\right)-<\log Q_\mathbf{d}^\infty>_{Q^0}$$

in the paper and the fact, that the original data distribution $Q^0$ is independent of the model parameters $\theta_m$, thus the partial derivative of the entropy of the data (denoted by $H(Q^0)$) w.r.t. the model parameters vanishes:

$$
-\frac{\partial}{\partial \theta_m} Q^0||Q^\infty
= \frac{\partial}{\partial \theta_m} \left (H\left(Q^0\right)+<\log Q_\mathbf{d}^\infty>_{Q^0} \right)
= \frac{\partial}{\partial \theta_m} <\log Q_\mathbf{d}^\infty >_{Q^0}
\\= \sum_\mathbf{d} Q_\mathbf{d}^0 \frac{\partial}{\partial \theta_m} \log(Q_\mathbf{d}^\infty)
= \left \langle \frac{\partial \log Q_\mathbf{d}^\infty }{\partial \theta_m} \right \rangle_{Q^0}
$$

In the next step I can use equation (4):

$$
\left\langle\frac{\partial \log Q_\mathbf{d}^\infty}{\partial \theta_m}\right\rangle_{Q^0} =\left\langle\frac{\partial \log p_{m}(\mathbf{d} | \theta_m)}{\partial \theta_m}\right\rangle_{Q^0}-\left\langle\frac{\partial \log p_{m}(\mathbf{c} | \theta_m)}{\partial \theta_m}\right\rangle_{Q^\infty}
$$

of the paper. If we now compare this result with the first two terms of the RHS of equation (5) only the second expectations differ:
$$
\left\langle\frac{\partial \log p_{m}(\mathbf{c} | \theta_m)}{\partial \theta_m}\right\rangle_{Q^\infty} \neq \left\langle\frac{\partial \log p_{m}(\hat{\mathbf{d}} | \theta_m)}{\partial \theta_m}\right\rangle_{Q^1}
$$

So is my interpretation right that we approximate the expectation over $Q^\infty$ with the expectation over $Q^1$? But in the sentence before equation (5) Hinton says:

The mathematical motivation for the contrastive divergence is that the intractable expectation over $Q^\infty$ on the RHS of Eq. 4 cancels out

However there is nothing to cancel out? What am I doing wrong here? Or is my initial starting point equation (5) already incorrect?

I am glad if anyone could help me understand the steps the author made to arrive at equation (5).

Best Answer

A few minor issues of notation:

  1. I've used $p$ instead of $Q$
  2. I have written the gradient rather than the negative gradient, and with respect to a generic parameter $\theta$ rather than $\theta_m$ for the $m^\text{th}$ expert.
  3. For shorthand I've used the energy, $E$, the negative log of the unnormalized distribution. Clearly, from Eq'n (1) in the paper, the derivative of the energy wrt the $m^\text{th}$ expert's parameters is $\frac{\text{d}E}{\text{d}\theta_m} = \frac{-\text{d}\log p_m}{\text{d}\theta_m}$, as you have above.

enter image description here

Each term in the third line corresponds to a term in the second line. You can see that the average of the energy gradient under the model distribution ($p^\infty$) appears twice, but with opposite signs, and so does indeed cancel.

Related Question