Alpha in Dirichlet Distribution – Detailed Explanation and Applications

bayesiandirichlet distributiondistributions

I'm fairly new to Bayesian statistics and I came across a corrected correlation measure, SparCC, that uses the Dirichlet process in the backend of it's algorithm. I have been trying to go through the algorithm step-by-step to really understand what is happening but I am not sure exactly what the alpha vector parameter does in a Dirichlet distribution and how it normalizes the alpha vector parameter?

The implementation is in Python using NumPy:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.dirichlet.html

The docs say:

alpha : array
Parameter of the distribution (k dimension for sample of
dimension k).

My questions:

  1. How do the alphas affect the distribution?;

  2. How are the alphas being normalized?;
    and

  3. What happens when the alphas are not integers?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Reproducibility
np.random.seed(0)

# Integer values for alphas
alphas = np.arange(10)
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Dirichlet Distribution
dd = np.random.dirichlet(alphas) 
# array([ 0.        ,  0.0175113 ,  0.00224837,  0.1041491 ,  0.1264133 ,
#         0.06936311,  0.13086698,  0.15698674,  0.13608845,  0.25637266])

# Plot
ax = pd.Series(dd).plot()
ax.set_xlabel("alpha")
ax.set_ylabel("Dirichlet Draw")

enter image description here

Best Answer

The Dirichlet distribution is a multivariate probability distribution that describes $k\ge2$ variables $X_1,\dots,X_k$, such that each $x_i \in (0,1)$ and $\sum_{i=1}^N x_i = 1$, that is parametrized by a vector of positive-valued parameters $\boldsymbol{\alpha} = (\alpha_1,\dots,\alpha_k)$. The parameters do not have to be integers, they only need to be positive real numbers. They are not "normalized" in any way, they are parameters of this distribution.

The Dirichlet distribution is a generalization of the beta distribution into multiple dimensions, so you can start by learning about the beta distribution. Beta is a univariate distribution of a random variable $X \in (0,1)$ parameterized by parameters $\alpha$ and $\beta$. The nice intuition about it comes if you recall that it is a conjugate prior for the binomial distribution and if we assume a beta prior parameterized by $\alpha$ and $\beta$ for the binomial distribution's probability parameter $p$, then the posterior distribution of $p$ is also a beta distribution parameterized by $\alpha' = \alpha + \text{number of successes}$ and $\beta' = \beta + \text{number of failures}$. So you can think of $\alpha$ and $\beta$ as of pseudocounts (they do not need to be integers) of successes and failures (check also this thread).

In the case of the Dirichlet distribution, it is a conjugate prior for the multinomial distribution. If in the case of the binomial distribution we can think of it in terms of drawing white and black balls with replacement from the urn, then in case of the multinomial distribution we are drawing with replacement $N$ balls appearing in $k$ colors, where each of colors of the balls can be drawn with probabilities $p_1,\dots,p_k$. The Dirichlet distribution is a conjugate prior for $p_1,\dots,p_k$ probabilities and $\alpha_1,\dots,\alpha_k$ parameters can be thought of as pseudocounts of balls of each color assumed a priori (but you should read also about the pitfalls of such reasoning). In Dirichlet-multinomial model $\alpha_1,\dots,\alpha_k$ get updated by summing them with observed counts in each category: $\alpha_1+n_1,\dots,\alpha_k+n_k$ in similar fashion as in case of beta-binomial model.

The higher value of $\alpha_i$, the greater "weight" of $X_i$ and the greater amount of the total "mass" is assigned to it (recall that in total it must be $x_1+\dots+x_k=1$). If all $\alpha_i$ are equal, the distribution is symmetric. If $\alpha_i < 1$, it can be thought of as anti-weight that pushes away $x_i$ toward extremes, while when it is high, it attracts $x_i$ toward some central value (central in the sense that all points are concentrated around it, not in the sense that it is symmetrically central). If $\alpha_1 = \dots = \alpha_k = 1$, then the points are uniformly distributed.

This can be seen on the plots below, where you can see trivariate Dirichlet distributions (unfortunately we can produce reasonable plots only up to three dimensions) parameterized by (a) $\alpha_1 = \alpha_2 = \alpha_3 = 1$, (b) $\alpha_1 = \alpha_2 = \alpha_3 = 10$, (c) $\alpha_1 = 1, \alpha_2 = 10, \alpha_3 = 5$, (d) $\alpha_1 = \alpha_2 = \alpha_3 = 0.2$.

Four different samples from Dirichlet distributions. In (a) the values are "uniformly" scattered all over the space, in (b) they are clustered around the center, in (c) they are clustered around one side (alpha_2), and slightly shifted towards another (alpha_3), in (d) the values are drifting away from the center, towards the borders.

The Dirichlet distribution is sometimes called a "distribution over distributions" since it can be thought of as a distribution of probabilities themselves. Notice that since each $x_i \in (0,1)$ and $\sum_{i=1}^k x_i = 1$, then $x_i$'s are consistent with the first and second axioms of probability. So you can use the Dirichlet distribution as a distribution of probabilities for discrete events described by distributions such as categorical or multinomial. It is not true that it is a distribution over any distributions, for example it is not related to probabilities of continuous random variables, or even some discrete ones (e.g. a Poisson distributed random variable describes probabilities of observing values that are any natural numbers, so to use a Dirichlet distribution over their probabilities, you'd need an infinite number of random variables $k$).

Related Question