Probability – What Happens When Merging Random Variables in Dirichlet Distribution?

dirichlet distributiondistributionsprobability

Imagine that

$$ X_1,\dots,X_k \sim \mathrm{Dirichlet}(\alpha_1,\dots,\alpha_k) $$

Since $x_i \in (0,1)$ for all $x_i$ and $\sum_{i=1}^k x_i = 1$, then $x_i$'s follow the first two axioms of probability and Dirichlet can be (and is) used as "distribution over distributions". Intuitively it should follow that

$$ X_1,\dots,X_{k-2},X_{k-1}+X_k \sim \mathrm{Dirichlet}(\alpha_1,\dots,\alpha_{k-2}, \alpha_{k-1}+\alpha_k) $$

since the properties of $x_i$'s would not change and the total "mass" of $\alpha_i$'s would not change.

But it's probability density function is

$$ f(x_1,\dots,x_k) \propto \prod_{i=1}^k x_i^{\alpha_i – 1}$$

and

$$
x_{k-1}^{\alpha_{k-1} – 1} \times x_k^{\alpha_k – 1} \ne
(x_{k-1} + x_k)^{\alpha_{k-1} + \alpha_k – 1}
$$

So merging of random variables in Dirichlet distribution does not seem lead to Dirichlet distribution over $k-1$ variables. What does it lead to?

Best Answer

It is a Dirichlet distribution having the expected parameters.

To see this, note that the vector-valued random variable $\mathbf{X}=(X_1, X_2, \ldots, X_k)$ has the same distribution as the variable

$$\frac{1}{\sum_i^k Y_i}\left(Y_1, Y_2, \ldots, Y_k\right)$$

where $Y_i \sim \Gamma(\alpha_i)$ are independently Gamma distributed. Write $Y_i^\prime=Y_i$ for $i=1, 2, \ldots, k-2$ and $Y_{k-1}^\prime = Y_{k-1}+Y_k$. The sum of all the $Y_i$ equals the sum of all the $Y_i^\prime$ and the distribution of $Y_{k-1}^\prime=Y_{k-1}+Y_k$ is $\Gamma(\alpha_{k-1}+ \alpha_k)$. Thus

$$X_{k-1} + X_k = \frac{1}{\sum_i^k Y_i} Y_{k-1} + \frac{1}{\sum_i^k Y_i} Y_{k} = \frac{1}{\sum_i^{k-1} Y_i^\prime} Y_{k-1}^\prime$$

and, for $i < k-1$,

$$X_i = \frac{1}{\sum_i^k Y_i} Y_{k-1} = \frac{1}{\sum_i^{k-1} Y_i^\prime} Y_{k-1}^\prime.$$

Therefore $\mathbf{X}^\prime=(X_1, X_2, \ldots, X_{k-2}, X_{k-1}+X_k)$ has the same distribution as

$$\frac{1}{\sum_i^{k-1} Y_i^\prime}\left(Y_1^\prime, Y_2^\prime, \ldots, Y_k^\prime\right).$$

This demonstrates that $\mathbf{X}^\prime$ has a Dirichlet$(\alpha_1, \alpha_2, \ldots, \alpha_{k-2}, \alpha_{k-1}+\alpha_k)$ distribution, QED.

The fault in the argument in the question lies in confusing the arithmetic sum of values $x_{k-1}+x_k$ with the sum of random variables $X_{k-1}+X_k$. The latter is performed with a convolution, of course.

Background

Over a century ago, mathematicians developed the theory of differential algebra to work with the "higher order derivatives" that occur in multi-dimensional geometry. The determinant is a special case of the basic objects manipulated by such algebras, which typically are alternating multilinear forms. The beauty of this lies in how simple the calculations can become.

Here's all you need to know.

A differential is an expression of the form "$dx_i$". It is the concatenation of "$d$" with any variable name.
A one-form is a linear combination of differentials, such as $dx_1+dx_2$ or even $x_2 dx_1 - \exp(x_2) dx_2$. That is, the coefficients are functions of the variables.
Forms can be "multiplied" using a wedge product, written $\wedge$. This product is anti-commutative (also called alternating): for any two one-forms $\omega$ and $\eta$,

$$\omega \wedge \eta = -\eta \wedge \omega.$$

This multiplication is linear and associative: in other words, it works in the familiar fashion. An immediate consequence is that $\omega \wedge \omega = -\omega \wedge \omega$, implying the square of any one-form is always zero. That makes multiplication extremely easy!

For the purposes of manipulating the integrands that appear in probability calculations, an expression like $dx_1 dx_2 \cdots dx_{k+1}$ can be understood as $|dx_1\wedge dx_2 \wedge \cdots \wedge dx_{k+1}|$.
When $y = g(x_1, \ldots, x_n)$ is a function, then its differential is given by differentiation:

$$dy = dg(x_1, \ldots, x_n) = \frac{\partial g}{\partial x_1}(x_1, \ldots, x_n) dx_1 + \cdots + \frac{\partial g}{\partial x_n}(x_1, \ldots, x_n) dx_n.$$

The connection with Jacobians is this: the Jacobian of a transformation $(y_1, \ldots, y_n) = F(x_1, \ldots, x_n) = (f_1(x_1, \ldots, x_n), \ldots, f_n(x_1, \ldots, x_n))$ is, up to sign, simply the coefficient of $dx_1\wedge \dots \wedge dx_n$ that appears in computing

$$dy_1 \wedge \cdots \wedge dy_n = df_1(x_1,\ldots, x_n)\wedge \cdots \wedge df_n(x_1, \ldots, x_n)$$

after expanding each of the $df_i$ as a linear combination of the $dx_j$ in rule (5).

Example

The simplicity of this definition of a Jacobian is appealing. Not yet convinced it's worthwhile? Consider the well-known problem of converting two-dimensional integrals from Cartesian coordinates $(x, y)$ to polar coordinates $(r,\theta)$, where $(x,y) = (r\cos(\theta), r\sin(\theta))$. The following is an utterly mechanical application of the preceding rules, where "$(*)$" is used to abbreviate expressions that will obviously disappear by virtue of rule (3), which implies $dr\wedge dr = d\theta\wedge d\theta = 0$.

$$\eqalign{ dx dy &= |dx\wedge dy| = |d(r\cos(\theta)) \wedge d(r\sin(\theta))| \\ &= |(\cos(\theta)dr - r\sin(\theta)d\theta) \wedge (\sin(\theta)dr + r\cos(\theta)d\theta| \\ &= |(*)dr\wedge dr + (*) d\theta\wedge d\theta - r\sin(\theta)d\theta\wedge \sin(\theta)dr + \cos(\theta)dr \wedge r\cos(\theta) d\theta| \\ &= |0 + 0 + r\sin^2(\theta) dr\wedge d\theta + r\cos^2(\theta) dr\wedge d\theta| \\ &= |r(\sin^2(\theta) + \cos^2(\theta)) dr\wedge d\theta)| \\ &= r\ dr d\theta }.$$

The point of this is the ease with which such calculations can be performed, without messing about with matrices, determinants, or other such multi-indicial objects. You just multiply things out, remembering that wedges are anti-commutative. It's easier than what is taught in high school algebra.

Preliminaries

Let's see this differential algebra in action. In this problem, the PDF of the joint distribution of $(X_1, X_2, \ldots, X_{k+1})$ is the product of the individual PDFs (because the $X_i$ are assumed to be independent). In order to handle the change to the variables $Y_i$ we must be explicit about the differential elements that will be integrated. These form the term $dx_1 dx_2 \cdots dx_{k+1}$. Including the PDF gives the probability element

$$\eqalign{ f_\mathbf{X}(\mathbf{x},\mathbf{\alpha})dx_1 \cdots dx_{k+1} &\propto \left(x_1^{\alpha_1-1}\exp\left(-x_1\right)\right)\cdots \left(x_{k+1}^{\alpha_{k+1}-1}\exp\left(-x_{k+1}\right) \right)dx_1 \cdots dx_{k+1} \\ &= x_1^{\alpha_1-1}\cdots x_{k+1}^{\alpha_{k+1}-1}\exp\left(-\left(x_1+\cdots+x_{k+1}\right)\right)dx_1 \cdots dx_{k+1}. }$$

(The normalizing constant has been ignored; it will be recovered at the end.)

Staring at the definitions of the $Y_i$ a few seconds ought to reveal the utility of introducing the new variable

$$Z = X_1 + X_2 + \cdots + X_{k+1},$$

giving the relationships

$$X_i = Y_i Z.$$

This suggests making the change of variables $x_i \to y_i z$ in the probability element. The intention is to retain the first $k$ variables $y_1, \ldots, y_k$ along with $z$ and then integrate out $z$. To do so, we have to re-express all the $dx_i$ in terms of the new variables. This is the heart of the problem. It's where the differential algebra takes place. To begin with,

$$dx_i = d(y_i z) = y_i dz + z dy_i.$$

Note that since $Y_1+Y_2+\cdots+Y_{k+1}=1$, then

$$0 = d(1) = d(y_1 + y_2 + \cdots + y_{k+1}) = dy_1 + dy_2 + \cdots + dy_{k+1}.$$

Consider the one-form

$$\omega = dx_1 + \cdots + dx_k = z(dy_1 + \cdots + dy_k) + (y_1+\cdots + y_k) dz.$$

It appears in the differential of the last variable:

$$\eqalign{ dx_{k+1} &= z dy_{k+1} + y_{k+1}dz \\ &= -z(dy_1 + \cdots + dy_k) + (1-y_1-\cdots -y_k)dz \\ &= dz - \omega. }$$

The value of this lies in the observation that

$$dx_1 \wedge \cdots \wedge dx_k \wedge \omega = 0$$

because, when you expand this product, there is one term containing $dx_1 \wedge dx_1 = 0$ as a factor, another containing $dx_2 \wedge dx_2 = 0$, and so on: they all disappear. Consequently,

$$\eqalign{ dx_1 \wedge \cdots \wedge dx_k \wedge dx_{k+1} &= dx_1 \wedge \cdots \wedge dx_k \wedge dz - dx_1 \wedge \cdots \wedge dx_k \wedge \omega \\ &= dx_1 \wedge \cdots \wedge dx_k \wedge dz. }$$

Whence (because all products $dz\wedge dz$ disappear),

$$\eqalign{ dx_1 \wedge \cdots \wedge dx_{k+1} &= (z dy_1 + y_1 dz) \wedge \cdots \wedge (z dy_k + y_k dz) \wedge dz \\ &= z^k dy_1 \wedge \cdots \wedge dy_k \wedge dz. }$$

The Jacobian is simply $|z^k| = z^k$, the coefficient of the differential product on the right hand side.

Solution

The transformation $(x_1, \ldots, x_k, x_{k+1})\to (y_1, \ldots, y_k, z)$ is one-to-one: its inverse is given by $x_i = y_i z$ for $1\le i\le k$ and $x_{k+1} = z(1-y_1-\cdots-y_k)$. Therefore we don't have to fuss any more about the new probability element; it simply is

$$\eqalign{ &(z y_1)^{\alpha_1-1}\cdots (z y_k)^{\alpha_k-1}\left(z(1-y_1-\cdots-y_k)\right)^{\alpha_{k+1}-1}\exp\left(-z\right)|z^k dy_1 \wedge \cdots \wedge dy_k \wedge dz| \\ &= \left(z^{\alpha_1+\cdots+\alpha_{k+1}-1}\exp\left(-z\right) dz\right)\left( y_1^{\alpha_1-1} \cdots y_k^{\alpha_k-1}\left(1-y_1-\cdots-y_k\right)^{\alpha_{k+1}-1}dy_1 \cdots dy_k\right). }$$

That is manifestly a product of a Gamma$(\alpha_1+\cdots+\alpha_{k+1})$ distribution (for $Z$) and a Dirichlet$(\mathbf\alpha)$ distribution (for $(Y_1,\ldots, Y_k)$). In fact, since the original normalizing constant must have been a product of $\Gamma(\alpha_i)$, we deduce immediately that the new normalizing constant must be divided by $\Gamma(\alpha_1+\cdots+\alpha_{k+1})$, enabling the PDF to be written

$$f_\mathbf{Y}(\mathbf{y},\mathbf{\alpha}) = \frac{\Gamma(\alpha_1+\cdots+\alpha_{k+1})}{\Gamma(\alpha_1)\cdots\Gamma(\alpha_{k+1})}\left( y_1^{\alpha_1-1} \cdots y_k^{\alpha_k-1}\left(1-y_1-\cdots-y_k\right)^{\alpha_{k+1}-1}\right).$$

Solved – Deriving the MAP estimate for Multinomial-Dirichlet

I see two mistakes in your steps.

First of all, taking the derivative with respect to a single $\theta_m$ cancels out all the other terms where $i \ne m$ in the sum. That is, $$ \frac{\partial}{\partial \theta_m} \sum_i (x_i + \alpha_i - 1) \log(\theta_i) = \frac{x_m + \alpha_m - 1}{\theta_m}, $$ as opposed to what you said. You now have $k$ equations to solve, one for each $m.$

Secondly, it's not enough to simply maximize the expression $\sum_i (x_i + \alpha_i - 1) \log(\theta_i).$ The solution is obviously $\theta_i \rightarrow \infty,$ because $\log$ monotonically increases. As you well know, we have the constraint that $\sum_i \theta_i = 1.$ You have to encode that constraint into your optimization via a Lagrange multiplier so that the correct solution isn't obviously infinity. That is, we must find the stationary point of

$$ L(\theta, \lambda) = \sum_{i=1}^k (x_i + \alpha_i - 1) \log(\theta_i) - \lambda \left[ \sum_{i=1}^k \theta_i - 1 \right]. $$

Taking the derivative of the above expression with respect to each $\theta_i$ and WRT the Lagrange multiplier $\lambda,$ and setting them to zero, you have to simultaneously solve,

$$ \frac{x_m + \alpha_m - 1}{\theta_m} = \lambda $$ for each $m,$ and $$ \sum_i \theta_i = 1. $$

That's not very hard to do. Rearrange the first equation above to get an expression of $\theta_m$ in terms of $\lambda.$ Then, you can sensibly find the value of $\lambda$ that correctly imposes the constraint.

Best Answer

Related Solutions

Solved – Construction of Dirichlet distribution with Gamma distribution

Background

Example

Preliminaries

Solution

Solved – Deriving the MAP estimate for Multinomial-Dirichlet

Related Question