Derivation of coordinate ascent variational inference

calculusvariational inference

From the slides of variational inference, it shows the evidence lower bound ($L$) and the derivative over a variational distribution $q(z_k)$, quoted as follows

$$
L_k = \int q(z_k) E_{-k} \bigg[ \log p(z_k|z_{-k},x) \bigg] dz_k – \int q(z_k) \log q(z_k) dz_k \tag{22}
$$

equation (22) focuses on the $k$th latent variable $z_k$, and assumes other latent variables $z_{-k}$ as constants, so that from (22), the derivative of $z_k$ can be derived

$$
\frac{dL}{dq(z_k)} = E_{-k} \bigg[ \log p(z_k|z_{-k},x) \bigg] – \log q(z_k) – 1 = 0 \tag{23}
$$

and then by a simple reformulation, we can now have the coordinate ascent update rule for $q(z_k)$

$$
q^*(z_k) \propto \exp \bigg\{ E_{-k} \bigg[ \log p(z_k|z_{-k},x) \bigg] \bigg\} \tag{24}
$$

Question, what I don't understand is how (23) can be like that?

Because according to my understanding of (22), the integral is over $z_k$ which is $\int_{z_k}$, not over $q(z_k)$ which is $\int_{q(z_k)}$. If so, instead of deriving $\frac{dL}{dz_k}$, how to derive $\frac{dL}{dq(z_k)}$?

Best Answer

I'll start with a quick derivation of the Euler-Lagrange formula, then I'll show how you can use it, then I'll show how it applies to your problem.

Background

Equation 22 is a functional -- a function that takes a function as input and outputs a real number.

We want to find the function $q_k(z_k)$ such that the functional in equation 22 is maximized. In other words, we'd like to pick a function $q_k(z_k)$ such that the left hand side, $L_k$, of equation 22 is as large as possible. You know how to do this if I give you a function $f(x)$ of some variable x and ask you to maximize it; you start by looking for stationary points -- you differentiate $f(x)$ w/r/t x, set equal to zero and solve for x to find the stationary points. The difference here is that instead of trying to find the values of the input variable which are stationary points for a function, you are trying to find the input functions which are stationary points of the functional.

Let's assume we're dealing with a normed linear space of functions defined on some domain [a,b] in x, i.e. a set of functions $V$ that satisfy some basic criteria ($f + g = g + f \hspace{0.4cm} \forall \hspace{0.2cm}f,g \in V$, there is a zero element 0 such that $0 + f = f$, for each $f$ in $V$ there is some element $g$ such that $f + g = 0$, etc.) and has an associated norm $\Vert f\Vert$ that yields the "length" of element $f$ from the set of functions. This enables us to quantify the "distance" between two functions f and g in this function space as $\Vert f-g\Vert$. One example of a possible norm on a space of functions of a single variable on a domain [a,b] might be:

$$\int_a^b|f(x)|dx$$

Let's further assume all the functions in this space are well-behaved on the domain of interest -- both continuous and twice-differentiable. We can define an operation on functionals that accept as input functions from this space called the Gateaux derivative which is analogous to the directional derivative in 3-space. In other words, let's say we have a functional $F[f(x)]$ where $f(x)$ is from the set $V$ and we add to it some function $h(x)$ also from $V$ multiplied by a small constant $\epsilon$. Let's stipulate that $h(x)$ is a function that is zero at both ends of the domain on which the functions in $V$ are defined, i.e. $h(a) = h(b) = 0$. The Gateaux derivative is then:

$$\delta F[f(x), h(x)] =\lim_{\epsilon\to0}\frac{F[f(x) + \epsilon h(x)] - F[f(x)]}{\epsilon} = \frac{d}{d\epsilon}F[f(x) + \epsilon h(x)] |_{\epsilon=0}$$

It can be shown that for a function $g(x)$ to be a stationary point of the functional $F[f(x)]$ (i.e. a possible maximum or minimum), a necessary condition is that $\delta F[g(x), h(x)]$ must be zero for all $h(x)$ that meet our criteria.

Now say you have a functional $F[f(x)]$ that takes function $f(x)$ as input and outputs a real number, where $f(x)$ is an element from this set of functions we're working with. Let's start from a function $f_0$ and add to it $\epsilon h$, where $h$ is another function in the set and $\epsilon$ is a constant small enough that $f_0 + \epsilon h$ is still in the set $V$. Let's say that $F$ is of the form:

$$F[f_0(x)] = \int_a^bL(x, f_0(x) + \epsilon h(x), f_0'(x) + \epsilon h'(x))dx$$

where $f_0'(x)$ and $h'(x)$ are the derivatives of $f_0(x)$ and $h(x)$. In other words, we're working with a functional that is an integral on [a,b] of an expression $L$ that contains $x$, $f_0(x)$ and $f_0'(x)$, and we're asking what happens if we add another function in the set multiplied by a small constant $\epsilon$ to the starting input $f_0$. You can probably see a few analogies to differentiating a function at a point $x_0$. For simplicity, I'll now start writing $f_0$ and $h$ in place of $f_0(x)$ and $h(x)$.

Now let's differentiate both sides of our functional w/r/t the constant epsilon. (We can take the derivative inside the integral because we already stipulated all our functions are well-behaved.)

$$\frac{d}{d\epsilon}F[f_0 + \epsilon h(x)] = \int_a^b \frac{\partial}{\partial \epsilon} L(x, f_0 + \epsilon h, f_0' + \epsilon h') dx = $$

$$\frac{d}{d\epsilon}F[f_0 + \epsilon h(x)] = \int_a^b \frac{\partial}{\partial f_0} L(x, f_0 + \epsilon h, f_0' + \epsilon h') h + \frac{\partial}{\partial f_0'} L(x, f_0 + \epsilon h, f_0' + \epsilon h') h' dx$$

We said that $\delta F[f(x), h(x)] = \frac{d}{d\epsilon}F[f(x) + \epsilon h(x)] |_{\epsilon=0}$; we just found $\frac{d}{d\epsilon}F[f(x) + \epsilon h(x)]$, so:

$$\delta F[f(x), h(x)] = \frac{d}{d\epsilon}F[f(x) + \epsilon h(x)] |_{\epsilon=0} = \int_a^b \frac{\partial}{\partial f_0} L(x, f_0, f_0') h + \frac{\partial}{\partial f_0'} L(x, f_0, f_0') h' dx$$

We want to find where this is equal to zero. So do integration by parts on the right hand side; because we stipulated that $h(a) = h(b) = 0$, this is easily seen to be:

$$\int_a^b \frac{\partial}{\partial f_0} L(x, f_0, f_0') h + \frac{\partial}{\partial f_0'} L(x, f_0, f_0') h dx$$

The fundamental lemma of the calculus of variations (I won't prove this here) says that if $\int_a^b f(x) h(x) dx = 0$ for all twice differentiable, continuous $h(x)$ such that $h(a) = h(b) = 0$, it must be true that $f(x) = 0$ for all x in the domain [a,b] of interest. This gives us the Euler-Lagrange formula:

$$\frac{\partial}{\partial f(x)}L(x, f(x), f'(x)) - \frac{d}{dx}\frac{\partial}{\partial f'(x)}L(x, f(x), f'(x)) = 0$$

To recap, we now know that if we want to find stationary points of a functional of the form:

$$F[f(x)] = \int_a^b L(x, f(x), f'(x))dx$$

and the functions we're considering meet some basic criteria, these stationary points (maxima and minima) must satisfy Euler-Lagrange. You want to maximize equation 22, the criteria we named all apply, and equation 22 is of this form, therefore you'll use Euler-Lagrange to find functions that are stationary points (and therefore possible maxima of your lower bound). Now let's look at a couple examples of how to use this.

Example

To use Euler-Lagrange, we're going to take the expression inside the integral in your functional, take the derivative of it with respect to $f(x)$ and $f'(x)$, plug these into Euler-Lagrange and solve the resulting differential equation. You'll take the derivative w/r/t $f(x)$ in the same way that you would if f(x) were a variable a instead of a function. So for example, if your functional is:

$$\int_a^b f(x)^2 + xf'(x)dx$$

then $L(x, f(x), f'(x)) = f(x)^2 + xf'(x)$, and $\frac{\partial}{\partial f(x)} = 2f(x)$ while $\frac{\partial}{\partial f'(x)} = x$.

A common example is the task of finding the shortest path between two points in 2-space. We can use the arc length integral:

$$\int_a^b \sqrt{1 + f'(x)^2}dx$$

In which case $L(x, f(x), f'(x)) = \sqrt{1 + f'(x)^2}$. The derivative w/r/t f(x) is 0. You can easily see that the derivative w/r/t f'(x) is $\frac{f'(x)}{\sqrt{1 + f'(x)^2}}$ (just mentally pretend that $f'(x)$ is some variable a and take the derivative w/r/t a). Plug this into Euler-Lagrange and you get:

$$\frac{d}{dx}\frac{f'(x)}{\sqrt{1 + f'(x)^2}} = 0$$

This differential equation clearly yields that:

$$\frac{f'(x)}{\sqrt{1 + f'(x)^2}} = C$$

Solve for $f'(x)$ and you'll see that $f'(x) = C_2$ for some constant $C_2$. Integrate w/r/t x and you'll see that $y=C_2x + C_3$ -- you can apply the boundary conditions to find the constants. Stationary points can be maxima or minima; in this case there is only one stationary point and we can show it is a minimum, thus establishing that the shortest path between two straight points is a line. (You knew this anyway, but this is another way to get to the same outcome.)

Equation 22

For your equation 22, you have a functional of $q_k(z_k)$ and would like to maximize it. The formula inside the integral, the $L$ expression or Lagrangian, is:

$$q_k(z_k)E_{-k}[log(p(z_k | z_{-k},x))] - q_k(z_k) \log q_k(z_k)$$

In this case the function that is input to the functional is $q_k(z_k)$. So we'll take the derivative of $L$ w/r/t $q_k(z_k)$ and $q_k'(z_k)$ and plug these into Euler-Lagrange. The derivative w/r/t $q_k'(z_k)$ is zero. The derivative w/r/t $q_k(z_k)$ (again, just mentally treat $q_k(z_k)$ as a variable a and take the derivative w/r/t it) is clearly:

$$E_{-k}[\log(p(z_k | z_{-k},x))] - \log q_k(z_k) - 1$$

Plug this into Euler-Lagrange and we have that the stationary point of your functional (the only point that could be a maximum of your functional, the KL divergence from the true posterior) is found when:

$$E_{-k}[\log(p(z_k | z_{-k},x))] - \log q_k(z_k) - 1 = 0$$

and there you have it!

You may find chapter 4 of Logan's Applied Mathematics helpful as well, this will provide a similar but more detailed overview.

Related Question