Solved – Intuition for why the (log) partition function matters

expected valueexponential-familyintuition

I'm on a quest for the intuition behind the fact that theoretical introductions to approximate inference focus so much on the log partition function. Say we have a regular exponential family $$p(\mathbf{x};\boldsymbol{\theta}) = \exp\left(\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\theta} – \log Z(\boldsymbol{\theta})\right)$$ with sufficient statistics $\boldsymbol{\phi}(\mathbf{x})$, natural parameters $\boldsymbol{\theta}$, and partition function $Z(\boldsymbol{\theta})$. The partition function is of course defined by $$Z(\boldsymbol{\theta}) = \int\exp\left(\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\theta}\right){\rm d}\mathbf{x} \quad .$$

EDIT: to clarify, the kind of problem I have in mind is where $\mathbf{x}$ is a latent variable in a graphical model with conditional exponential family distributions, as is the focus of Wainwright & Jordan (2008), for instance. Finding an optimal $\boldsymbol{\theta}$ may be a variational inference problem. Conditioned on some data, another common problem would be drawing posterior samples of $\mathbf{x}$.

In my experience, textbooks and tutorials on approximate inference often make claims like "inference is hard because computing the (log) partition function is hard." I don't doubt that computing the log partition function is hard, but I do fail to see why that is "the" barrier to inference.

First, let me explain where I am coming from… To begin, I have a decent grasp of the following:

  1. We need the partition function to compute expected values. If we only know the unnormalized distribution $p^*(\mathbf{x};\boldsymbol{\theta}) = \exp\left(\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\theta}\right)=p(\mathbf{x};\boldsymbol{\theta})Z(\boldsymbol{\theta})$, then we also only know $\mathbb{E}[f(\mathbf{x})]$ up to scaling by $Z(\boldsymbol{\theta})$.
  2. Exact inference is #P-Hard in the worst case.
  3. If we have the gradient of the log partition function, then we have the mapping between natural parameters and mean parameters, $$\nabla_\boldsymbol{\theta} \log Z(\boldsymbol{\theta}) = \mathbb{E}\left[\boldsymbol{\phi}(\mathbf{x})\right]\equiv\boldsymbol{\mu} \quad ,$$ and knowing the mean parameters $\boldsymbol{\mu}$ can aid in other stages of inference or in computing expected values in some circumstances (e.g. if $f$ lies in the span of $\boldsymbol{\phi}$, then $\mathbb{E}[f(\mathbf{x})]$ is linear in $\boldsymbol{\mu}$).

All that being said, I still don't get why computing $\log Z$ is "the" hard problem in inference.

Consider this thought experiment: imagine you are given an oracle who computes $Z(\boldsymbol{\theta})$ efficiently. What can you now do that you could not do before? Take bullet (1) above – can you now compute expected values more easily? It seems to me that there remains a difficult problem, namely computing a high-dimensional integral over $\mathbf{x}$. In fact, much of the space may have negligible probability mass. Personally, I would rather have an oracle that tells me which regions of $\mathbf{x}-$space to look in — solve the search problem for me, e.g. by providing a set of samples of $\mathbf{x}$ from the posterior or something close to it. Digging into this notion of “search'' a little deeper, note that this is how Self-Normalized Importance Sampling (SNIS) works: you draw samples from a proposal distribution that is essentially guess about where $\mathbf{x}$ has non-negligible mass, then plug in an estimate of $Z(\boldsymbol{\theta})$ based on those samples, namely $$\hat{Z}(\boldsymbol{\theta}) = \frac{1}{S}\sum_{i=1}^S p^*(\mathbf{x}^{(i)};\boldsymbol{\theta}) \qquad \mathbf{x}^{(i)}\sim q(\mathbf{x})\quad.$$ The hard problem in SNIS is constructing a good proposal distribution $q$, then you get $Z(\boldsymbol{\theta})$ "for free."

One way to find the relevant regions of $\mathbf{x}$ would be to find the mode(s) of $p$. This means solving $$\nabla_\mathbf{x} \log p(\mathbf{x};\boldsymbol{\theta}) = \boldsymbol{\theta}^\top\nabla_\mathbf{x}\boldsymbol{\phi}(\mathbf{x}) = \mathbf{0} $$ (some abuse of notation here… you get the idea). But the difficulty of this depends on $\boldsymbol{\phi}$; the partition function is not involved.

To summarize, I see inference as having two core problems: (a) a search problem for the relevant region of $\mathbf{x}$ (high-probability regions, modes, etc.), and (b) a normalization problem of computing (log) $Z(\boldsymbol{\theta})$. I am puzzled why the latter (b) receives so much attention, especially since solving (a) can give (b) for free, but not the other way around as far as I can tell. So, what is the intuition behind the emphasis on the log partition function?

Best Answer

This is how Self-Normalized Importance Sampling (SNIS) works - you draw samples from a proposal distribution that is essentially guess about where

This shows how the lack of knowledge about $\log Z$ can be solved.

But it doesn't mean that lack of knowledge of $\log Z$ is not a problem.

In fact the SNIS method shows that not knowing $\log Z$ is a problem. It is a problem and we need to use a trick in order to solve it. If we knew $\log Z$ then our sampling method would perform better.

Example

See for instance in the example below where we have a beta distributed variable

$$f_X(x) \propto x^2 \quad \qquad \qquad \text{for $\quad 0 \leq x \leq 1$}$$

And we wish to estimate the expectation value for $log(X)$.

Because this is a simple example we know that $E_X[log(X)] = -1/3$ by calculating it analytically. But here we are gonna use self-normalized importance sampling and sampling with another beta distribution $f_Y(y) \propto (1-y)^2$ to illustrate the difference.

  • In one case we compute it with an exact normalization factor. We can do this because we know $log(Z)$, as for a beta distribution it is not so difficult.

    $$E_X[log(X)] \approx \frac{\sum_{\forall y_i} log(y_i) \frac{y_i^2}{(1-y_i)^2}}{1}$$

  • In the other case we compute it with self-normalization

    $$E_X[log(X)] \approx \frac{\sum_{\forall y_i} log(y_i) \frac{y_i^2}{(1-y_i)^2}}{\sum_{\forall y_i} \frac{y_i^2}{(1-y_i)^2}}$$

So the difference is whether this factor in the denominator is a constant based on the partition function $\log(Z)$ (or actually ratio of partition functions for X and Y), or a random variable $\sum_{\forall y_i} {y_i^2}/{(1-y_i)^2}$.

Intuitively you may guess that this latter will increase bias and variance of the estimate.

The image below gives the histograms for estimates with samples of size 100.

comparison of self-normalization or direct computation of Z

ns <- 100
nt <- 10^3

mt <- rep(0,nt)
zt <- rep(0,nt)

for (i in 1:nt) {
  y <- rbeta(ns,1,3)
  t <- log(y)*y^2/(1-y)^2
  z <- y^2/(1-y)^2
  mt[i] <- mean(t)
  zt[i] <- mean(z)
}

h1 <- hist(mt, breaks = seq(-1,0,0.01), main = "using known parition function")
h2 <- hist(mt/zt , breaks = seq(-1,0,0.01), main = "using self-normalization")