[Math] How to use the bits you get back from Bits Back Coding

Bits Back coding is a scheme to transmit an observation $x$.
You can read about it here^[1]. To my understanding, it works like this:

The encoder samples a message $z$ from a distribution $Q(z|x)$ that it computed.
The encoder sends $z$ to decoder, using a code based on $P(z)$, known to both.
The decoder recovers $z$ and computes $P(x|z)$.
The encoder sends $x$ to the decoder using a code based on $P(x|z)$.
The decoder recovers $x$.

In reality, the messages in steps 2 and 4 can be combined. Naively, the cost of this protocol would be the expected number of bits in these messages.

However, the trick is that once the decoder recovered $x$, it can compute $Q(z|x)$ using the same algorithm the encoder used, and then recover all the auxiliary bits it had to use internally in order to sample $z$ from $Q$. The expected number of these bits is then subtracted from the total cost. Those are "the bits you get back".

What I don't understand is what is the decoder going to do with these bits? How does knowing these bits reduces the overall transmission cost? Yes, the decoder is now aware of some bits that the encoder withdrew from a random-stream of bits, but how is this knowledge going to help in compressing the next observations?

^[1]: Frey and Hinton. Efficient Stochastic Source Coding and an Application to a Bayesian Network Source Model

Best Answer

I was wondering the same! But this is the answer I came up with (although I am not entirely sure if this reasoning is correct).

Since after the decoding process the decoder knows the distribution $Q(z|x)$ AND $P(x)$, he could then come up with an encoding that follows the distribution $P(z)/Q(z|x)$ and thus, represents each sample value $z$ with $-log_2(P(z)/Q(z|x))$ bits. Apparently, this representation of the sample values (and thus, of the model) would lead to maximal compression of the data values if the approximate variational posterior equals the actual posterior, thus $Q(z|x) = P(z|x)$.

Best Answer

Related Solutions

[Math] get standard error from correlation coefficient

[Math] What characterizations of relative information are known

Related Question