[Math] KL divergence of multinomial distribution

information theorymachine learningprobabilityprobability distributionsprobability theory

Consider $q(x)$ be a Multinomial distribution over $\{1, \ldots, k\}$ with parameters $\{\theta_1,\ldots, \theta_k\}$. And p(x) over $\{1,\ldots, k\}$ with distribution $p(x)=\frac{1}{k}$. Then what is the KL divergence?

I know that for a discrete distribution, $KL(p(x)||q(x))=\sum p(x)\log\frac{p(x)}{q(x)}$. But as for the KL divergence on multinomial distributions, is the $x$ becomes a vector $\mathbf{x}$? Then what is the log of the ratio? and will the answer of the KL divergence also be a vector? Thanks so much.

Best Answer

Well, let's review a few helpful definitions that will clarify how to calculate the Kullback-Leibler divergence here.

By definition the summation of the parameters of the mutlinomial distribution is 1; i.e., $$\sum_{m=1}^k\theta_m=1$$,

where $\theta_m$ is the probability of the $m^{th}$ outcome occuring.

The probability mass function (PMF) of the multinomial distribution is $$q(x)=\frac{n!}{\Pi_{m=1}^k(x_m!)}\Pi_{m=1}^k\theta_m^{x_m},$$ where $n$ is the total number of independent experiments executed such that $$\sum_{m=1}^kx_m=n$$.

Now let's also consider another multinomial distribution $p(x)$ as $$p(x)=\frac{n!}{\Pi_{m=1}^k(x_m!)}\Pi_{m=1}^k\left(\frac{1}{k}\right)^{x_m}=\frac{n!}{\Pi_{m=1}^k(x_m!)}\left(\frac{1}{k}\right)^{n}.$$

The resultant Kullback-Leibler divergence may then be calculated in a variety of equivalent statements $$D_{KL}(p(x)||q(x))= \sum_{m=1}^k \left(\frac{1}{k}\log\left(\frac{\frac{1}{k}}{\theta_m}\right)\right)=-\sum_{m=1}^k \left(\frac{1}{k}\log\left(k\theta_m\right)\right)=-\frac{1}{k}\log\left(k^k\Pi_{m=1}^k\theta_m\right)\\=-\frac{1}{k}\left(k\log(k)+\log(\Pi_{m=1}^k\theta_m)\right)=-\log(k)-\frac{1}{k}\sum_{m=1}^k\log(\theta_m)=\log\left(\frac{1}{k}\right)-\sum_{m=1}^k\frac{1}{k}log\left(\theta_m\right)=\sum_{m=1}^k\frac{1}{k}\log\left(\frac{1}{k}\right)-\sum_{m=1}^k\frac{1}{k}log\left(\theta_m\right)=-H\left(p(x)\right)-\mathbb{E}_{p(x)}[log(q(x))]\\=\mathbb{E}_{p(x)}\left[\log\left(\frac{1}{q(x)}\right)\right]-H(p(x)).$$ Notice that no vectors are required to calculate the Kullback-Leibler divergence between two multinomial distributions with the same number of categories.