Alpha in Dirichlet Distribution – Detailed Explanation and Applications

bayesiandirichlet distributiondistributions

I'm fairly new to Bayesian statistics and I came across a corrected correlation measure, SparCC, that uses the Dirichlet process in the backend of it's algorithm. I have been trying to go through the algorithm step-by-step to really understand what is happening but I am not sure exactly what the alpha vector parameter does in a Dirichlet distribution and how it normalizes the alpha vector parameter?

The implementation is in Python using NumPy:
https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.dirichlet.html

The docs say:

alpha : array
Parameter of the distribution (k dimension for sample of
dimension k).

My questions:

How do the alphas affect the distribution?;
How are the alphas being normalized?;
and
What happens when the alphas are not integers?

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Reproducibility
np.random.seed(0)

# Integer values for alphas
alphas = np.arange(10)
# array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

# Dirichlet Distribution
dd = np.random.dirichlet(alphas) 
# array([ 0.        ,  0.0175113 ,  0.00224837,  0.1041491 ,  0.1264133 ,
#         0.06936311,  0.13086698,  0.15698674,  0.13608845,  0.25637266])

# Plot
ax = pd.Series(dd).plot()
ax.set_xlabel("alpha")
ax.set_ylabel("Dirichlet Draw")

Best Answer

The Dirichlet distribution is a multivariate probability distribution that describes $k\ge2$ variables $X_1,\dots,X_k$, such that each $x_i \in (0,1)$ and $\sum_{i=1}^N x_i = 1$, that is parametrized by a vector of positive-valued parameters $\boldsymbol{\alpha} = (\alpha_1,\dots,\alpha_k)$. The parameters do not have to be integers, they only need to be positive real numbers. They are not "normalized" in any way, they are parameters of this distribution.

The Dirichlet distribution is a generalization of the beta distribution into multiple dimensions, so you can start by learning about the beta distribution. Beta is a univariate distribution of a random variable $X \in (0,1)$ parameterized by parameters $\alpha$ and $\beta$. The nice intuition about it comes if you recall that it is a conjugate prior for the binomial distribution and if we assume a beta prior parameterized by $\alpha$ and $\beta$ for the binomial distribution's probability parameter $p$, then the posterior distribution of $p$ is also a beta distribution parameterized by $\alpha' = \alpha + \text{number of successes}$ and $\beta' = \beta + \text{number of failures}$. So you can think of $\alpha$ and $\beta$ as of pseudocounts (they do not need to be integers) of successes and failures (check also this thread).

In the case of the Dirichlet distribution, it is a conjugate prior for the multinomial distribution. If in the case of the binomial distribution we can think of it in terms of drawing white and black balls with replacement from the urn, then in case of the multinomial distribution we are drawing with replacement $N$ balls appearing in $k$ colors, where each of colors of the balls can be drawn with probabilities $p_1,\dots,p_k$. The Dirichlet distribution is a conjugate prior for $p_1,\dots,p_k$ probabilities and $\alpha_1,\dots,\alpha_k$ parameters can be thought of as pseudocounts of balls of each color assumed a priori (but you should read also about the pitfalls of such reasoning). In Dirichlet-multinomial model $\alpha_1,\dots,\alpha_k$ get updated by summing them with observed counts in each category: $\alpha_1+n_1,\dots,\alpha_k+n_k$ in similar fashion as in case of beta-binomial model.

The higher value of $\alpha_i$, the greater "weight" of $X_i$ and the greater amount of the total "mass" is assigned to it (recall that in total it must be $x_1+\dots+x_k=1$). If all $\alpha_i$ are equal, the distribution is symmetric. If $\alpha_i < 1$, it can be thought of as anti-weight that pushes away $x_i$ toward extremes, while when it is high, it attracts $x_i$ toward some central value (central in the sense that all points are concentrated around it, not in the sense that it is symmetrically central). If $\alpha_1 = \dots = \alpha_k = 1$, then the points are uniformly distributed.

This can be seen on the plots below, where you can see trivariate Dirichlet distributions (unfortunately we can produce reasonable plots only up to three dimensions) parameterized by (a) $\alpha_1 = \alpha_2 = \alpha_3 = 1$, (b) $\alpha_1 = \alpha_2 = \alpha_3 = 10$, (c) $\alpha_1 = 1, \alpha_2 = 10, \alpha_3 = 5$, (d) $\alpha_1 = \alpha_2 = \alpha_3 = 0.2$.

The Dirichlet distribution is sometimes called a "distribution over distributions" since it can be thought of as a distribution of probabilities themselves. Notice that since each $x_i \in (0,1)$ and $\sum_{i=1}^k x_i = 1$, then $x_i$'s are consistent with the first and second axioms of probability. So you can use the Dirichlet distribution as a distribution of probabilities for discrete events described by distributions such as categorical or multinomial. It is not true that it is a distribution over any distributions, for example it is not related to probabilities of continuous random variables, or even some discrete ones (e.g. a Poisson distributed random variable describes probabilities of observing values that are any natural numbers, so to use a Dirichlet distribution over their probabilities, you'd need an infinite number of random variables $k$).

Related Solutions

Solved – Posterior distribution for multinomial parameter

Unfortunately, the data is a bit difficult to deal with, since it consists of mostly "soft evidence", so the parameter estimation doesn't seem to have an easy analytic solution such as a direct update of Dirichlet counts.

— why does that follow? Why not just scale the counts based on the amount of evidence supporting each one?

If you have samples of the posterior distribution, why can't you use the sufficient statistics to turn those samples into a maximum likelihood distribution?

Update after your recent edits:

The likelihood on $\theta$ given $o$ is true has density

\begin{align} T_1(x) \propto 0.9 x + 0.1(1-x), \end{align}

and false has density, say

\begin{align} T_2(x) \propto 0.2 x + 0.8(1-x). \end{align}

Then, the final density after $\eta_1$ observations of true and $\eta_2$ observations of false is proportional to

\begin{align} T_1(x)^{\eta_1}T_2(x)^{\eta_2} \end{align}

which is nevertheless an exponential family, although not a Beta distribution as you rightly point out.

Solved – PyMC for nonparametric clustering: Dirichlet process to estimate Gaussian mixture’s parameters fails to cluster

I'm not sure if anyone is looking at this question any more but I put your question in to rjags to test Tom's Gibbs sampling suggestion while incorporating insight from Guy about the flat prior for standard deviation.

This toy problem might be difficult because 10 and even 40 data points are not enough to estimate variance without an informative prior. The current prior σzi∼Uniform(0,100) is not informative. This might explain why nearly all draws of μzi are the expected mean of the two distributions. If it does not alter your question too much I will use 100 and 400 data points respectively.

I also did not use the stick breaking process directly in my code. The wikipedia page for the dirichlet process made me think p ~ Dir(a/k) would be ok.

Finally it is only a semi-parametric implementation since it still takes a number of clusters k. I don't know how to make an infinite mixture model in rjags.

markov chain mu cluster 1

markov chain mu cluster 2

library("rjags")

set1 <- rnorm(100, 0, 1)
set2 <- rnorm(400, 4, 1)
data <- c(set1, set2)

plot(data, type='l', col='blue', lwd=3,
     main='gaussian mixture model data',
     xlab='data sample #', ylab='data value')
points(data, col='blue')

cpd.model.str <- 'model {
  a ~ dunif(0.3, 100)
  for (i in 1:k){
    alpha[i] <- a/k
    mu[i] ~ dnorm(0.0, 0.001)
    sigma[i] ~ dunif(0, 100)
  }
  p[1:k] ~ ddirich(alpha[1:k])
  for (i in 1:n){
    z[i] ~ dcat(p)
    y[i] ~ dnorm(mu[z[i]], pow(sigma[z[i]], -2))
  }
}' 


cpd.model <- jags.model(textConnection(cpd.model.str),
                        data=list(y=data,
                                  n=length(data),
                                  k=5))
update(cpd.model, 1000)
chain <- coda.samples(model = cpd.model, n.iter = 1000,
                      variable.names = c('p', 'mu', 'sigma'))
rchain <- as.matrix(chain)
apply(rchain, 2, mean)

Best Answer

Related Solutions

Solved – Posterior distribution for multinomial parameter

Solved – PyMC for nonparametric clustering: Dirichlet process to estimate Gaussian mixture’s parameters fails to cluster

Related Question