Solved – Interpreting the entropy of a Dirichlet distribution

ccategorical datadirichlet distributionentropy

I was looking for a measure to interpret the "spikiness" of categorical histograms. So, if it becomes unnaturally skewed towards a certain value at a given time, I want a metric that will show some kind of a spike at that time. I considered a variety of metrics for this purpose and finally settled on the Entropy of a Dirichlet distribution (considering the histogram of counts as a sample from a Dirichlet and using the corresponding Entropy as my metric). For this, I used the formula for Entropy in the Wikipedia article .
$$H = logB(\alpha) + (\alpha_0 – K)\psi(\alpha_0) – \sum(\alpha_i-1)\psi(\alpha_i) $$

Here, $\alpha$ is the vector of counts in various categorical bins of the histogram, $\alpha_0 = \sum\alpha_i$, $B$ is the multivariate beta function, $\psi$ is the digamma function. I implemented this in C# (implementation pasted below) and am having some problem with interpreting the results. I would expect an $\alpha$ with a flat distribution (uniformly spread across its categories) to have a higher entropy than one that is spiked towards a given category and this holds true. The gap in interpretation arises with various kinds of flat distributions. My expectation would be that an alpha described by the array {x,x,x} would have increasingly higher values of entropy as x increased. The reason for this belief is that a higher sample would mean we are increasingly sure that the distribution is flat and should increase the Entropy. What I see in practice is this –

x: 0.1, Entropy: -13.025

x: 0.6, Entropy: -4.82

x: 1.0, Entropy: -0.693

x: 1.6, Entropy: -0.8164

x: 2.1, Entropy: -0.967

As you can see, there seems to be a maxima at x=1. This goes against my intuition. Can anyone help me interpret this or let me know if these results are rubbish.

private static double regularizer = 0.1;
    /// <summary>
    /// Based on the formula for total information entropy given here - https://en.wikipedia.org/wiki/Dirichlet_distribution.
    /// </summary>
    /// <param name="alpha">alpha: The parameters of the Dirichlet distribution. These correspond to a histogram with counts.</param>
    /// <returns>The Entropy of a Dirichlet distribution.</returns>
    public double Entropy(double[] alpha)
    {
        _2_gammafamily g = new _2_gammafamily();
        double alpha_0 = 0, H = 0;//The sum of coefficients (normalizing factor) and final entropy term respectively.
        int K = alpha.Length;
        for (int i = 0; i < K; i++)
        {
            alpha[i] += regularizer;//Before doing anything else, we regularize the parameters which is equivalent to a uniform prior.
            alpha_0 += alpha[i];
            H += g.Gammaln(alpha[i]);//Positive part of normalization constant (which is the log of a multivariate beta distribution).
            H -= (alpha[i] - 1) * g.Digamma(alpha[i]); //The contribution from each of the alphas.
        }
        H -= g.Gammaln(alpha_0);//Negative part of normalization constant.
        H += (alpha_0 - K) * g.Digamma(alpha_0);//The contribution from the normalizing factor.
        return H;
    }

Best Answer

In case people are interested and for my own reference, I found the answer in the Wikipedia article on Dirichlet distributions. As we increase the parameter $\alpha$ of a symmetric Dirichlet distribution, we start getting samples that are more likely to be all equal to each other. For a 3-d Dirichlet for example, we will be more likely to see (.33,.33,.33). Since this becomes more likely, we can predict with larger certainty what the sample is going to look like. Hence the reduction in Entropy. When $\alpha$ is below 1, we are more likely to see sparser samples (where most are zero or very small). In these cases as well, the Entropy becomes lower. The case of $\alpha = 1$ most closely resembles white noise and so, the Entropy is highest for this case.

Related Question