Joint Distribution – Is Outer Product of Marginal Distribution the Best Mean-Field Approximation for a Joint Distribution?

joint distributionmarginal-distributionstochastic gradient descentvariational-bayes

I am certain this has been asked somewhere else, if that's the case, link me and close the thread.

I am studying variational inference and mean-field approximation. All the explanations I come across deal with gaussians and continuous distributions which are too much for me to handle right now. I want to just understand the simplest case of discrete distributions first, and I'm unable to find the resource online, so here's where I need help.

The set up is as follows. We are trying to approximate a joint with two factors $P(X,Y) \approx Q_1(X)Q_2(Y)$. One way to do so (with "modern" technology aka pytorch) is to set up a loss function $KL(Q_1(X)Q_2(Y) ~||~ P(X,Y))$ and optimize it under the constraint that $Q1$ and $Q2$ are valid probability distributions.

Here are my questions:

Question 1

Is the optimal solution for $Q_1$ and $Q_2$ simply the respective marginals of $P(X,Y)$, $P(X)$ and $P(Y)$? Is this still the case if we were to use a different loss function that is not KL (i.e. L2, weisserstein) ? I am feeling there is some quirks about KL that could make this bit more pathological.

Question 2

If we can only sample batches from the joint distribution $P(X,Y)$, how might we learn $Q_1$ and $Q_2$ in a sgd style? Should we optimize $Q_1$ and $Q_2$ together, or, assume the answer to question 1 is true, sample a bunch of data points from $P(X,Y)$ and create the sampled marginal distribution $\hat{P}(X)$ and $\hat{P}(Y)$ and optimize $Q_1$ and $Q_2$ separately?

Thanks a ton!

Best Answer

This is really two questions, and I'm only going to address the first one: No.

There's a nice illustration here from Eric Jang for a simpler case: approximating a Gaussian mixture by a single Gaussian, and, yes, it is related to the special properties of the KL divergence. In particular, $KL(Q_xQ_y||P)$ is infinite if there is any point where $Q_xQ_y$ places mass but $P$ places no mass.

Using $KL(Q_xQ_y||P)$ as your objective function means the top priority for optimisation is having the density $Q_xQ_y$ be zero whenever the density $P$ is zero, and more generally having $Q_xQ_y$ small whenever $P$ is small. Now, consider real $X$, $Y$ where $P$ is uniform on a ring centered at the origin, a sort of thickened unit circle. The marginal distributions are U-shaped, but with non-negligible probability near zero. The product of the marginals has density everywhere on a square, higher at the edges and lower in the middle, but non-zero everywhere. So, near the origin, $Q_xQ_y$ is non-zero and $P$ is zero and $KL(Q_xQ_y||P)$ is infinite.

In this example even the optimal product distribution will be a terrible approximation, but it is possible to construct products that have finite $KL(Q_xQ_y||P)$, which is better than infinite.

You might object that you are interested in distributions that have non-zero density everywhere, but that just complicates the analysis; there will still be examples where $KL(Q_xQ_y||P)$ for the product of marginals is finite but very large.

Related Question