Solved – What does it mean to sample a probability vector from a Dirichlet distribution

dirichlet distributiondistributionsprobabilitysampling

I'm essentially learning about Latent Dirichlet Allocation. I'm watching a video here: http://videolectures.net/mlss09uk_blei_tm/ and stuck at minute 45 when he started to explain on sampling from the distribution.

Also I tried to consult a machine learning book that doesn't have a detailed introductory on Dirichelt distribution. In the book I'm reading it mentioned an example on sampling "probability vectors" from the Dirichlet distribution, but what does that mean?

I understand sampling from a distribution as getting random values for the random variables according to the distribution. So let p_X,Y(x,y) but the pmf of any distribution, sampling from this distribuiton means I get a random (x,y) (i.e. random values for x and y). To get the probablity of the getting the event (X=x AND Y=y) we evalute the pmf of the distribution … so we get only one number. But what are "probability vectors" here!!

I attached a screenshot for the book. I really hope you can help!

enter image description here

Best Answer

A Dirichlet distribution is often used to probabilistically categorize events among several categories. Suppose that weather events take a Dirichlet distribution. We might then think that tomorrow's weather has probability of being sunny equal to 0.25, probability of rain equal to 0.5, and probability of snow equal to 0.25. Collecting these values in a vector creates a vector of probabilities.

Another way to think about a Dirichlet distribution is the process of breaking a stick. Imagine a stick of unit length. Break that stick anywhere and retain one of the two pieces. Then break the remaining piece into two pieces and continue this as long as you desire. All of the pieces together must sum to unit length, and allocating pieces of different lengths to different events represents the probability of that event.

If you're familiar with the beta distribution, the Dirichlet distribution might become even more clear. A beta distribution is often used to describe a distribution of probabilities of dichotomous events, so its restricted to the unit interval. For example, for a Bernoulli trial, there is only a parameter $\theta$ describing the probability of a "success." Often we think of $\theta$ as being fixed, but if we are uncertain about the "true" value of $\theta$, we could think about a distribution of all possible $\theta$s, with a larger likelihood for those we consider more plausible, so perhaps $\theta \sim \text{B}(\alpha, \beta)$, where $\alpha>\beta$ concentrates more of the mass near 1 and $\beta > \alpha$ concentrates more of the mass near 0.

One might object that the beta distribution only describes the probability of a single probability, that is, for example, that $P(\theta<0.25)=0.5$, which is a scalar number. But keep in mind that the beta distribution is describing dichotomous outcomes. So by applying Kolmogorov's second axiom, we also know that $P(\theta \ge 0.25)=0.5$ as well. Collecting these results in a vector gives us a vector of probabilities.

Extending the beta distribution into three or more categories gives us the Dirichlet distribution; indeed, the PDF of the Dirichlet for two groups is the exact same as the beta distribution.

Related Question