I'm on a quest for the intuition behind the fact that theoretical introductions to approximate inference focus so much on the log partition function. Say we have a regular exponential family $$p(\mathbf{x};\boldsymbol{\theta}) = \exp\left(\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\theta} – \log Z(\boldsymbol{\theta})\right)$$ with sufficient statistics $\boldsymbol{\phi}(\mathbf{x})$, natural parameters $\boldsymbol{\theta}$, and partition function $Z(\boldsymbol{\theta})$. The partition function is of course defined by $$Z(\boldsymbol{\theta}) = \int\exp\left(\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\theta}\right){\rm d}\mathbf{x} \quad .$$
EDIT: to clarify, the kind of problem I have in mind is where $\mathbf{x}$ is a latent variable in a graphical model with conditional exponential family distributions, as is the focus of Wainwright & Jordan (2008), for instance. Finding an optimal $\boldsymbol{\theta}$ may be a variational inference problem. Conditioned on some data, another common problem would be drawing posterior samples of $\mathbf{x}$.
In my experience, textbooks and tutorials on approximate inference often make claims like "inference is hard because computing the (log) partition function is hard." I don't doubt that computing the log partition function is hard, but I do fail to see why that is "the" barrier to inference.
First, let me explain where I am coming from… To begin, I have a decent grasp of the following:
- We need the partition function to compute expected values. If we only know the unnormalized distribution $p^*(\mathbf{x};\boldsymbol{\theta}) = \exp\left(\boldsymbol{\phi}(\mathbf{x})^\top\boldsymbol{\theta}\right)=p(\mathbf{x};\boldsymbol{\theta})Z(\boldsymbol{\theta})$, then we also only know $\mathbb{E}[f(\mathbf{x})]$ up to scaling by $Z(\boldsymbol{\theta})$.
- Exact inference is #P-Hard in the worst case.
- If we have the gradient of the log partition function, then we have the mapping between natural parameters and mean parameters, $$\nabla_\boldsymbol{\theta} \log Z(\boldsymbol{\theta}) = \mathbb{E}\left[\boldsymbol{\phi}(\mathbf{x})\right]\equiv\boldsymbol{\mu} \quad ,$$ and knowing the mean parameters $\boldsymbol{\mu}$ can aid in other stages of inference or in computing expected values in some circumstances (e.g. if $f$ lies in the span of $\boldsymbol{\phi}$, then $\mathbb{E}[f(\mathbf{x})]$ is linear in $\boldsymbol{\mu}$).
All that being said, I still don't get why computing $\log Z$ is "the" hard problem in inference.
Consider this thought experiment: imagine you are given an oracle who computes $Z(\boldsymbol{\theta})$ efficiently. What can you now do that you could not do before? Take bullet (1) above – can you now compute expected values more easily? It seems to me that there remains a difficult problem, namely computing a high-dimensional integral over $\mathbf{x}$. In fact, much of the space may have negligible probability mass. Personally, I would rather have an oracle that tells me which regions of $\mathbf{x}-$space to look in — solve the search problem for me, e.g. by providing a set of samples of $\mathbf{x}$ from the posterior or something close to it. Digging into this notion of “search'' a little deeper, note that this is how Self-Normalized Importance Sampling (SNIS) works: you draw samples from a proposal distribution that is essentially guess about where $\mathbf{x}$ has non-negligible mass, then plug in an estimate of $Z(\boldsymbol{\theta})$ based on those samples, namely $$\hat{Z}(\boldsymbol{\theta}) = \frac{1}{S}\sum_{i=1}^S p^*(\mathbf{x}^{(i)};\boldsymbol{\theta}) \qquad \mathbf{x}^{(i)}\sim q(\mathbf{x})\quad.$$ The hard problem in SNIS is constructing a good proposal distribution $q$, then you get $Z(\boldsymbol{\theta})$ "for free."
One way to find the relevant regions of $\mathbf{x}$ would be to find the mode(s) of $p$. This means solving $$\nabla_\mathbf{x} \log p(\mathbf{x};\boldsymbol{\theta}) = \boldsymbol{\theta}^\top\nabla_\mathbf{x}\boldsymbol{\phi}(\mathbf{x}) = \mathbf{0} $$ (some abuse of notation here… you get the idea). But the difficulty of this depends on $\boldsymbol{\phi}$; the partition function is not involved.
To summarize, I see inference as having two core problems: (a) a search problem for the relevant region of $\mathbf{x}$ (high-probability regions, modes, etc.), and (b) a normalization problem of computing (log) $Z(\boldsymbol{\theta})$. I am puzzled why the latter (b) receives so much attention, especially since solving (a) can give (b) for free, but not the other way around as far as I can tell. So, what is the intuition behind the emphasis on the log partition function?
Best Answer
This shows how the lack of knowledge about $\log Z$ can be solved.
But it doesn't mean that lack of knowledge of $\log Z$ is not a problem.
In fact the SNIS method shows that not knowing $\log Z$ is a problem. It is a problem and we need to use a trick in order to solve it. If we knew $\log Z$ then our sampling method would perform better.
Example
See for instance in the example below where we have a beta distributed variable
$$f_X(x) \propto x^2 \quad \qquad \qquad \text{for $\quad 0 \leq x \leq 1$}$$
And we wish to estimate the expectation value for $log(X)$.
Because this is a simple example we know that $E_X[log(X)] = -1/3$ by calculating it analytically. But here we are gonna use self-normalized importance sampling and sampling with another beta distribution $f_Y(y) \propto (1-y)^2$ to illustrate the difference.
In one case we compute it with an exact normalization factor. We can do this because we know $log(Z)$, as for a beta distribution it is not so difficult.
$$E_X[log(X)] \approx \frac{\sum_{\forall y_i} log(y_i) \frac{y_i^2}{(1-y_i)^2}}{1}$$
In the other case we compute it with self-normalization
$$E_X[log(X)] \approx \frac{\sum_{\forall y_i} log(y_i) \frac{y_i^2}{(1-y_i)^2}}{\sum_{\forall y_i} \frac{y_i^2}{(1-y_i)^2}}$$
So the difference is whether this factor in the denominator is a constant based on the partition function $\log(Z)$ (or actually ratio of partition functions for X and Y), or a random variable $\sum_{\forall y_i} {y_i^2}/{(1-y_i)^2}$.
Intuitively you may guess that this latter will increase bias and variance of the estimate.
The image below gives the histograms for estimates with samples of size 100.