[Math] Optimization of relative entropy

entropyit.information-theoryoc.optimization-and-control

Wondering if my following question is an application of information theory:

Lets say we have a factory and ship boxes of stuff outside. If a competitor stands outside my factory, observes the stream of boxes coming out of a factory and note their sizes, he can get valuable information about my manufacturing process. So in order to hoodwink him, I manually expand the boxes to sizes different than the original ones. In order to make sure that I am doing a good job at masking the original size distribution from the resultant distribution, I must make sure that I maximize relative entropy between output distribution q and input distribution p, given that I cannot just create a uniform output distribution. Is this premise correct? Do you find any flaw in the question?

The constraint, of course, is that we must use minimum amount of box material. In other words, sum of box sizes for a unit period of time must not exceed a threshold B.
i.e. $\frac{\sum_{j=1}^{i}{BS_i}}{T} \le B$

Other constraints are:

1] Both p and q must be probability distributions i.e. sum up to 1
2] $BS_i \le MS$, where MS is the maximum size of a box.

If we take $BS_i$ as the size of the $i^{th}$ box, then we have an optimization function applying Lagrange multiplier $\lambda$:

$ \Lambda(BS_i,\lambda,p,q) = D(q||p) + \lambda (\frac{\sum_{j=1}^{i}{BS_i}}{T} – B) $
$ = \sum_{S=1}^{n}{q(S) \log \frac{q(S)}{p(S)}} + \lambda (\frac{\sum_{j=1}^{i}{BS_i}}{T} – B)$

$ = \sum_{S=1}^{n}{q(S) \log q(S)} -\sum_{S=1}^{n}{q(S) \log p(S)} + \lambda (\frac{\sum_{j=1}^{i}{BS_i}}{T} – B)$

To proceed:
Lets say we at the $i^{th}$ iteration. We know the actual box sizes that we already sent out for 1…i-1 iterations, and the actual size of the present box. i.e. $BS_0..BS_i$ are known, and hence p(BS) can be calculated and hence is a known value. We know the output box sizes till the $i-1^{th}$ iteration.

So, we have,

$ \Lambda(BS_i,\lambda,q) = \sum_{S=1}^{n}{q(S) \log q(S)} -\sum_{S=1}^{n}{k*q(S) } + \lambda (\frac{\sum_{j=1}^{i}{BS_i}}{T} – B) $

How do we now solve the optimization function, in order to get the output box size for this $i^{th} iteration?

We have,
$\frac{\partial\Lambda}{\partial BS_i} = 0$,
$\frac{\partial\Lambda}{\partial q(S)} = 0$, and
$\frac{\partial\Lambda}{\partial \lambda} = 0$

The pertial differentiation wrt $\lambda$ results in a trivial solution. Since q(S) is dependent on $BS_i$, we will get a factor of $\frac{\partial q(S)}{\partial BS_i}$ in the other two equations.

How do I proceed from here? What I am thinking wrong?

Best Answer

Let us imagine that your factory manufactures two products, one of which is small, and the other is large. These products are shipped out in boxes. Suppose that your boxes come in two sizes, small and large. Suppose further that you can ship a small product in a large box, but that you cannot ship a large product in a small box.

Instead of products / boxes of various sizes, a more information-theoretic way of looking at things would be to think of the factory as a binary source, and to view the box-enlargement process as a binary channel. Let $X$ and $Y$ be discrete random variables with alphabets $\mathcal{X}$ and $\mathcal{Y}$, respectively, where $\mathcal{X} = \mathcal{Y} = \{0,1\}$. If the output of the production line is a small product, then $X = 0$, otherwise $X = 1$. If a small box is shipped out, then $Y = 0$, otherwise $Y = 1$. Hence, the random variable $X$ gives us the size of the product, while the random variable $Y$ gives us the size of the box. We can view $X$ and $Y$ as the input and output of a binary channel, respectively.

To deceive your competitors, every time a small product is ready to be shipped you flip a coin and, depending on the outcome, you choose to ship the small product in a large box or not. If you do so, then $X = 0$ and $Y = 1$. The "channel" has introduced an error. The channel is defined by the transition probabilities

$\{ \mathbb{P}[Y = 0 \mid X = 0], \mathbb{P}[Y = 1 \mid X = 0], \mathbb{P}[Y = 0 \mid X = 1], \mathbb{P}[Y = 1 \mid X = 1] \}$.

A competitor observes the sizes of the boxes being shipped out and tries to infer what the actual sizes of the products inside the boxes are. In other words, your competitor would like to infer what the probability mass function (p.m.f.) of $X$ is, knowing only the p.m.f. of $Y$. To keep your competitor maximally confused, you would like to maximize the conditional entropy $H (X \mid Y)$, which is the uncertainty about $X$ given $Y$. Recall that the mutual information is

$I (X;Y) = H(X) - H(X \mid Y)$

and it gives us the reduction in the uncertainty of $X$ due to knowledge of $Y$. We would like to minimize the mutual information, which is equivalent to maximizing the conditional entropy $H(X \mid Y)$, as $H(X)$ is fixed (depends on the p.m.f of $X$, which is assumed to be fixed).

The mutual information can be written as $I(X;Y) = D( p(x,y) \| p(x) p(y) )$, which is the Kullback-Leibler distance between the joint p.m.f. and the product of the marginal p.m.f.'s. Check [1] for details. Therefore, you have a relative entropy minimization problem.

Usually, we are given the channel, and we choose the p.m.f. of $X$ that maximizes the mutual information $I(X;Y)$. In this problem, we are given the p.m.f. of $X$, and we choose the channel that minimizes the mutual information. It's a sort of "dual" of finding the capacity of a given channel.

References:

[1] Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, John Wiley & Sons 2006.

Related Question