Solved – Basics of Reparametrization trick in Machine Learning

gradient descentmachine learning

I am trying to understand the reparameterization trick (RPT) used in the calculation of stochastic backpropagation. There are already some excellent answers here and here.

Under usual notation, we can represent the RPT $$\nabla_\theta \mathbb{E}_{p(x;\theta)}[f(x)] = \mathbb{E}_{p(\epsilon)}[\nabla_\theta f(g(\epsilon, \theta))]$$

The overarching objective of this equation is to show that both the gradients ( one with reparamterization and the other with out reparameterization) give rise to the same estimates.

My question:

So, I am trying to get my hands-dirty by calculating the gradients using the reparameterization trick.

Say, we observe a set of univariate random variables $x_1,\dots,x_n$. Some oracle has told us that these random variables come from a Normal distribution $N(\mu,\sigma^2)$.

This is a silly problem obviously, and we can get the estimate of $\mu$ and $\sigma$ from many techniques like MLE, MAP, so on. But, I just want to see how we go about using stochastic gradient descent and reparameterization trick.

Now, let's say we have a node that spits out random variables $x_1, \dots, x_n$. We cannot use the simple backpropagation as the output of the node – $y$ – is a stochastic one. Assume, our loss function $f(y)$ is squared error i.e. $(y-x_i)^2$. Using the LHS of RPT equation,

\begin{align}\nabla_\theta \mathbb{E}_{p(x;\theta)}[f(x)] &= \nabla_\theta \int \sum_{i=1}^{n}(y-x_i)^2 p(y) dy \\
&= \nabla_\theta \int \sum_{i=1}^{n}(y^2+x_i^{2} – 2yx_i) p(y) dy \\
&= \nabla_\theta \sum_{i=1}^{n}(\mu^2+\sigma^2 +x_i^2 -2*(\mu)*(x_i)) \\ \end{align}

Now, if we take the specific gradients with respect to $\mu$ and $\sigma^2$ and set to zero, we can see the updates as $\mu = \frac{\sum_{i=1}^{n}x_i}{n}$ and I am unable to get the estimate of $\sigma^2$. Theoretically, I think the estimate of $\sigma^2 = (\frac{x_i^2}{n} – \mu^2)$

Now I am trying to get the same using the RHS of RPT equation,

\begin{align}\mathbb{E}_{p(\epsilon)}[\nabla_\theta f(g(\epsilon, \theta))] &= \int \nabla_\theta \sum_{i=1}^{n}(\mu+\sigma \epsilon-x_i)^2 p(\epsilon) d\epsilon \\
&= \int \sum_{i=1}^{n} (2(\mu+\sigma \epsilon-x_i) \nabla_\theta (\mu+\sigma \epsilon-x_i)) p(\epsilon) d\epsilon \\
\end{align}

Again, if we take the gradient with respect to $\mu$ specifically, we will have
\begin{align}
\mathbb{E}_{p(\epsilon)}[\nabla_\mu f(g(\epsilon, \mu,\sigma))] = \int \sum_{i=1}^{n} (2(\mu+\sigma \epsilon-x_i) p(\epsilon) d\epsilon \\
\end{align}

Setting the above equation to zero, we can see that $$\mu = \frac{\sum_{i=1}^{n}x_i}{n}$$

This is the same as obtained from the LHS of RPT equation.

However, I try to show the same for the $\sigma^2$ estimate that both LHS and RHS of RPT equation give the same estimate but it isn't working out.

Can you please help me show the estimate for $\sigma^2$ using the Reparameterization trick for the simple Normal distribution case.

Edit:

Also, can we show from this simple case, that variance of the estimate obtained using RPT is lower than the one obtained without RPT?

Best Answer

To my knowledge, reparameterization trick is used to make the Monte-Carlo estimated ELBO have lower variance. In your formulation you seem to use it in other cases. Also, you shouldn't consider doing calculation by hand given very few machine learning problems can be solved analytically. Have you checked the VAE's code? $\mu$ and $\sigma$ are learnt in a nerual network.

To know more details about VAE, you can check my blog post.

Related Question