Using mutual information in bayesian statistics

bayesianconditional-expectationentropyinformation theorymachine learning

This question is about Bayesian statistics and active learning, I'm using the following notation:

  • $x$ is an input variable
  • $y$ is an output variable
  • $\theta$ are latent parameters
  • a model $p(y|x, \theta)$ is given
  • data observed sofar $D = \{(x_i, y_i)\}_{i=1}^n$ is given
  • a model of latent parameters is given $p(\theta|D)$

Active learning describes a problem where new points $x$ can be choosen freely and answers $y$ are given by an oracle. Each such interaction provides new data $D' = D \cup (x, y)$ and leads to a new posterior of the parameters $p(\theta | D')$.
The goal is to choose $x$ as efficiently as possible, to maximize the information gain.

This notation is used in the paper Bayesian Active Learning for Classification and Preference Learning. From this paper also comes my question. If you want to read up on it, it's right at the beginning, basically the first two equations. You find them on page 3.

Query points are chosen to improve the estimate of the latent parameters. This leads to an optimization problem, which maximizes the expected decrease in posterior entropy:

$argmax_x \left(H[\theta|D] – \mathbb{E}_{y \sim p(y|x, D)}[H[\theta | D \cup (x,y)]]\right)$

The authors now explain that this is a potential problem. The entropy is calculated for the latent parameters $\theta$, based on their posterior probability. This is computationally expensive.

The objective of the original problem is the measure of mutual information between latent parameters and query outcome. It can be written as:

$argmax_x I[\theta ,y | x, D]$

The mutual information of two random variables is symmetric: $I[x;y] = I[y;x]$. This makes it possible to rewrite the problem as:

$argmax_x I[y, \theta| x, D] = argmax_x \left(H[y | x, D] -\mathbb{E}_{\theta \sim p(\theta|D)}[H[y|x, \theta]]\right)$

Now, this supposedly fixes the computational problem: Entropies are now computed over $y$, which is typically in a much smaller space than $\theta$.

Which finally brings me to my question:

I don't understand how mutual information is used here. It looks to me as if the two terms in the last equation are the same.

The first term is the entropy of $y$, given the data seen so far $D$ and the new input $x$, $H[y | x, D]$. But this is as much information as we have. Isn't this the posterior entropy of $y$, after the query?

When you write out the formulas for the entropy, it is also the same:
We only have a model $p(y | x, \theta)$. So to compute the entropy over $y$, we need to go via the latent parameters. As far as I can see, the two terms are identical:

$H[y|x, D] = \mathbb{E}_{\theta \sim p(\theta | D)} [\mathbb{E}_{y \sim p(y |x, \theta )} [-log\; p(y|x, \theta)]] = \mathbb{E}_{\theta \sim p(\theta|D)}[H[y|x, \theta]]$

Proposed solution:

I have thought about this, and I think the rewritten version of the optimization problem is missing an update to $\theta$. The second entropy term calculates the expected entropy of y, based on samples of $\theta$ drawn from the current belief $p(\theta | D)$. This doesn't really fit the idea of mutual information: How much do you know about y after seeing the new $\theta$.

I've rewritten it below, using $\theta'$ to denote the updated belief. It's the expected entropy of y, based on the expected new latent parameters, based on the expected outcome of y, based on the expected current parameters. Sorry, this will be horribly nested:

$\mathbb{E}_{\theta \sim p(\theta | D)} [\mathbb{E}_{y \sim p(y |x, \theta )} [\mathbb{E}_{\theta ' \sim p(\theta |x, y, D)} [H[y' | x, \theta']]]] $

This new version makes a lot of sense to me, but I'm not sure if it is correct. Especially since the authors say about their newly derived version (last paragraph on page 3):

Also $\theta$ is now only conditioned on $D$, so only $O(1)$ posterior
updates are needed

Well, if my derivation is correct, that doesn't seem true. On the contrary, this is still an integral over the updated posterior of $\theta$. Can someone explain how to properly use mutual information here?

Best Answer

I think I found the error in my calculations. The first term $H[y| x, D]$ indeed depends on $p(\theta|D)$. The probability is $p(y | x, D) = p(y|x, \theta)p(\theta | D)$. The entropy integrates over the negative log of this:

$H[y|x, D] = \int \int -log(p(y|x, \theta)p(x|D)) dx, dy \neq \int \int -log(p(y|x, \theta)) p(\theta|D) dx, dy = \mathbb{E}_\theta[H[y|\theta, D]]$

The formula is hard to read, the difference is that I moved the probability of theta out of the log, which leads to an expectation and is not correct.

Related Question