Solved – Calculating Log Prob. of Dirichlet distribution in High Dimensions

dirichlet distribution

I'm interested in calculating the log probability of data drawn from a Dirichlet distribution. In particular, I'm interested in calculations that are stable in high dimensions, perhaps 1000 dimensions or more.

I've tried to use the "dirichlet_logProb.m" function from Tom Minka's fastfit toolbox: http://research.microsoft.com/en-us/um/people/minka/software/fastfit/. However, in high dimensions, this seems to give unexpected large positive density values to every data vector I provide.

Certainly, when a distribution is sharply peaked, we can have arbitrarily large (greater than one) density values at some points in the support. For example, think of a zero-mean Normal distribution with very small variance… it will give very large density (>>1) to x close to zero, but vanishingly small density (<<1) to x more than a few standard deviations away.

The same should be true of any distribution: we should always be able to find points that have large density and small density, since the density needs to integrate to one over the support. However, in high dimensions I cannot get my Dirichlet log probability calculations to work out this way… every single point I try has density greater than one.

Here's a sketch of my experiment: I consider a symmetric Dirichlet with concentration parameter <1. This should prefer data that are sparse, where only some dimensions have significant probability mass, and all others are near zero. I then look at three possible draws, which I illustrate here using K=4 dimensions.

Extremely sparse: x = [1 0 0 0]
Halfway uniform: x = [0.5 0.5 0 0]
Uniform: x = [0.25 0.25 0.25 0.25]

Given concentration parameter lambda < 1, we should expect 1 to have much larger density than 2, and 2 to have much larger density than 3. Since the uniform is basically the opposite of sparsity, we should also expect this outcome to have negligible (less than one) density… which corresponds to negative log probability.

Here is a result of the computed log probabilities for these three outcomes, across many values of the dimension K, with concentration=0.01. Remember, these are log probabilities, so positive values imply density > 1, and negative values density < 1.

        [1 0...0]  [unif 0..0] [ unif ]
K=   4   9.18e+01   5.59e+01  -9.71e+00
K=  10   2.77e+02   1.43e+02  -2.09e+01
K=  50   1.52e+03   7.85e+02  -3.58e+01
K= 100   3.07e+03   1.64e+03  -4.04e+00
K= 500   1.55e+04   8.98e+03   7.80e+02
K=1000   3.11e+04   1.87e+04   2.25e+03

You can reproduce this table with this code: http://gist.github.com/3699842

The problem, of course, is that with K=500 and K=1000 all of the outcomes 1,2,3 have very positive log probability (density >> 1). This seems troublesome to me, since I'd expect the most unlikely outcome (the uniform) to have density less than one… am I mistaken somehow?

Can anybody suggest an answer? Is this a numerical issue? Or do I misunderstand something about the Dirichlet? My prime suspect is the fact that the support region is a bit funny (a simplex, rather than the reals).

Best Answer

The p.d.f of the Dirichlet distribution is defined as $$ f(\theta; \alpha) = B^{-1} \prod_{i=1}^K \theta_i^{\alpha_i - 1} $$ where $B(\alpha)$ is the generalized Beta function. Notice that if any $\theta_i$ is 0, then the whole product is zero. In other words, the support of a Dirichlet distribution is over vectors $\theta$ where each $\theta_i \in (0, 1)$ and $\sum_{i=1}^K \theta_i = 1$. I'm not familiar with Minka's toolkit, but it is bound to have problems with data that includes 0's.

As for the uniform column, I believe those values are correct. Here my python code I used to test:

import math

def lbeta(alpha):
    return sum(math.lgamma(a) for a in alpha) - math.lgamma(sum(alpha))

def ldirichlet_pdf(alpha, theta):
    kernel = sum((a - 1) * math.log(t) for a, t in zip(alpha, theta))
    return kernel - lbeta(alpha)

for k in [4, 10, 50, 100, 500, 1000]:
    print ldirichlet_pdf([.01] * k, [1.0 / k] * k)

Running this script yields the output:

4 -9.71111566837
10 -20.946493708
50 -35.7564901905
100 -4.03613939138
500 779.669123528
1000 2251.99967563

Now lets generate some more likely vectors from our Dirichlet distribution with K=1000. The code for this is quite simple:

def sample_dirichlet(alpha):
    gammas = [random.gammavariate(a, 1) for a in alpha]
    norm = sum(gammas)
    return [g / norm for g in gammas]

Now if we use this function in combination with our previous ldirichlet_pdf function, we'll see that for K=1000, 2e+3 is a relatively small density. For example, the results of the following code:

alpha = [.01] * 1000
ldirichlet_pdf(alpha, sample_dirichlet(alpha))

yield values between 9.4e+4 and 1e+5.

The key insight here is to realize that you do not need to have values less than 1 in order for the integral to evaluate to 1. A simple example is $\int_0^1 1 dx = 1$. It just so happens that for the symmetric Dirichlet with K=1000 and concentration .01, the p.d.f. is greater than 1 everywhere, and yet the integral over the entire support is still 1. In higher dimensions, you'll need to have a much smaller concentration to get the uniform to have a negative log p.d.f. For example, with the concentration at .0001, and K=1000, the uniform vector has a log p.d.f. around -2.3e+3.

Best Answer

Related Solutions

Solved – a Dirichlet prior

Distributions – Characterizing Multinomial as Discretized Dirichlet Distribution

Notation

Very informal justification

Rigorous proof

Step 1

Step 2

Step 3

Related Question