Multinomial Distribution – Expected Value Calculation

distributionsmultinomial-distributionprobability

A multinomial distribution can be given as

$ M(m_1,\dots,m_K|N,P) = {N \choose m_1\dots m_K}\prod_k p_k^{m_k} $

The expected value is $Np_k$.

How can I prove it?

Best Answer

A demonstration using "equations" was requested in a comment. Here is a short, simple one that is practically painless.

Notation and definitions

Let the random $K$-vector $X$ have a multinomial distribution with parameters $\mathbb p = (p_1, p_2, \ldots, p_K)$. This means that $p_1 + p_2 + \cdots + p_K=1$, $0 \le p_i$ for $i=1, 2, \ldots, K$, and the probability that $X = (m_1, m_2, \ldots, m_K) = \mathbb m$ is given by

$$\Pr(X=\mathbb m) =\binom{N}{\mathbb m}\mathbb p^\mathbb m$$

In this shorthand notation $\binom{N}{\mathbb m} = N!/(m_1! m_2! \ldots m_K!)$ is a multinomial coefficient (which is nonzero only when all the $m_i$ are natural numbers and sum to $N \ge 1$) and $\mathbb p ^ \mathbb m = p_1^{m_1}p_2^{m_2}\cdots p_K^{m_k}.$

By definition, the expectation of $X$ is the vector

$$\mathbb E[X] = \sum_{\mathbb m} \Pr(X = \mathbb m)\mathbb m =\sum_{\mathbb m} \binom{N}{\mathbb m}\mathbb p^\mathbb m \mathbb m$$

where the sum extends over the (finite number of) values of $\mathbb m$ for which the probability is nonzero.

Solution

By expanding the sum using the definition of the multinomial coefficients, notice that

$$1 = 1^N = (p_1 + p_2 + \cdots + p_K)^N = \sum_{\mathbb m}\binom{N}{\mathbb m}\mathbb p^\mathbb m.$$

Viewing the $p_i$ as variables, we can recognize the component terms $\binom{N}{\mathbb m}\mathbb p^\mathbb m m_i$ in the expectation as the result of applying the differential operator $p_i\frac{\partial}{\partial p_i}$ to the right hand side, because $p_i\frac{\partial}{\partial p_i} \left(p_i^{m_i}\right) = m_i p_i^{m_i}.$ Another way to compute the same thing is to use the Chain Rule to differentiate the penultimate term in the preceding multinomial expansion:

$$p_i\frac{\partial}{\partial p_i}(p_1 + p_2 + \cdots + p_K)^N = p_iN(p_1 + p_2 + \cdots + p_K)^{N-1}\frac{\partial p_i}{\partial p_i} = Np_i(1)^{N-1} = Np_i.$$

Therefore

$$\mathbb E[X] = (Np_1, Np_2, \ldots, Np_K),$$

QED.

Related Solutions

Solved – Understanding multinomial distribution

Suppose you roll a 6-sided die $N$ times.

The outcome of roll $i$, $i=1,\ldots,N$, is represented by the random variable $X_i$. The tuple $\mathbf{X}=\left(X_1,\ldots,X_N\right)$ contains the outcome of each roll.

We can obtain category-level count information from $\mathbf{X}$ by taking $N_j=\sum_{i=1}^{N}\delta\left(X_i=j\right)$, $j=1,\ldots,6$. The tuple $\mathbf{N}=\left(N_1,\ldots,N_6\right)$ contains the counts for each category.

What's the difference between having $\mathbf{X}$ and $\mathbf{N}$? They both arise from $N$ trials of a multinomial distribution with six possible outcomes, each with equal probability of occurring. However, when we discuss probability with respect to $\mathbf{X}$ we are talking about the probability of a specific sequence of outcomes. When we discuss probability with respect to $\mathbf{N}$ we are talking about the probability of a specific set of counts. There is a normalizing factor with the trial-level information, but it's just $1$ because there is only one way to get any specific sequence of outcomes.

EDIT The second section of the paper actually discusses when to use counts and when to use samples.

Solved – How to sample a truncated multinomial distribution

If I understand you correctly, you want to sample $x_1,\dots,x_k$ values from multinomial distribution with probabilities $p_1,\dots,p_k$ such that $\sum_i x_i = n$, however you want the distribution to be truncated so $a_i \le x_i \le b_i$ for all $x_i$.

I see three solutions (neither as elegant as in non-truncated case):

Accept-reject. Sample from non-truncated multinomial, accept the sample if it fits the truncation boundaries, otherwise reject and repeat the process. It is fast, but can be very inefficient.

rtrmnomReject <- function(R, n, p, a, b) {
  x <- t(rmultinom(R, n, p))
  x[apply(a <= x & x <= b, 1, all) & rowSums(x) == n, ]
}

Direct simulation. Sample in fashion that resembles the data-generating process, i.e. sample single marble from a random urn and repeat this process until you sampled $n$ marbles in total, but as you deploy the total number of marbles from given urn ($x_i$ is already equal to $b_i$) then stop drawing from such urn. I implemented this in a script below.

# single draw from truncated multinomial with a,b truncation points
rtrmnomDirect <- function(n, p, a, b) {
  k <- length(p)

  repeat {
    pp <- p         # reset pp
    x <- numeric(k) # reset x
    repeat {
      if (sum(x<b) == 1) { # if only a single category is left
        x[x<b] <- x[x<b] + n-sum(x) # fill this category with reminder
        break
      }
      i <- sample.int(k, 1, prob = pp) # sample x[i]
      x[i] <- x[i] + 1  
      if (x[i] == b[i]) pp[i] <- 0 # if x[i] is filled do
      # not sample from it
      if (sum(x) == n) break    # if we picked n, stop
    }
    if (all(x >= a)) break # if all x>=a sample is valid
    # otherwise reject
  }

  return(x)
}

Metropolis algorithm. Finally, the third and most efficient approach would be to use Metropolis algorithm. The algorithm is initialized by using direct simulation (but can be initialized differently) to draw first sample $X_1$. In following steps iteratively: proposal value $y = q(X_{i-1})$ is accepted as $X_i$ with probability $f(y)/f(X_{i-1})$, otherwise $X_{i-1}$ value is taken in it's place, where $f(x) \propto \prod_i p_i^{x_i}/x_i!$. As a proposal I used function $q$ that takes $X_{i-1}$ value and randomly flips from 0 to step number of cases and moves it to another category.

# draw R values
# 'step' parameter defines magnitude of jumps
# for Meteropolis algorithm
# 'init' is a vector of values to start with
rtrmnomMetrop <- function(R, n, p, a, b,
                          step = 1,
                          init = rtrmnomDirect(n, p, a, b)) {

  k <- length(p)
  if (length(a)==1) a <- rep(a, k)
  if (length(b)==1) b <- rep(b, k)

  # approximate target log-density
  lp <- log(p)
  lf <- function(x) {
    if(any(x < a) || any(x > b) || sum(x) != n)
      return(-Inf)
    sum(lp*x - lfactorial(x))
  }

  step <- max(2, step+1)

  # proposal function
  q <- function(x) {
    idx <- sample.int(k, 2)
    u <- sample.int(step, 1)-1
    x[idx] <- x[idx] + c(-u, u)
    x
  }

  tmp <- init
  x <- matrix(nrow = R, ncol = k)
  ar <- 0

  for (i in 1:R) {
    proposal <- q(tmp)
    prob <- exp(lf(proposal) - lf(tmp))
    if (runif(1) < prob) {
      tmp <- proposal
      ar <- ar + 1
    }
    x[i,] <- tmp
  }

  structure(x, acceptance.rate = ar/R, step = step-1)
}

The algorithm starts at $X_1$ and then wanders around the different regions of distribution. It is obviously faster then the previous ones, but you need to remember that if you'd use it to sample small number of cases, then you could end up with draws that are close to each other. Another problem is that you need to decide about step size, i.e. how big jumps should the algorithm make -- too small may lead to moving slowly, too big may lead to making too many invalid proposals and rejecting them. You can see example of it's usage below. On the plots you can see: marginal densities in the first row, traceplots in the second row and plots showing subsequent jumps for pairs of variables.

n <- 500
a <- 50
b <- 125
p <- c(1,5,2,4,3)/15
k <- length(p)
x <- rtrmnomMetrop(1e4, n, p, a, b, step = 15)

cmb <- combn(1:k, 2)

par.def <- par(mfrow=c(4,5), mar = c(2,2,2,2))
for (i in 1:k)
  hist(x[,i], main = paste0("X",i))
for (i in 1:k)
  plot(x[,i], main = paste0("X",i), type = "l", col = "lightblue")
for (i in 1:ncol(cmb))
  plot(jitter(x[,cmb[1,i]]), jitter(x[,cmb[2,i]]),
       type = "l", main = paste(paste0("X", cmb[,i]), collapse = ":"),
       col = "gray")
par(par.def)

The problem with sampling from this distribution is that describes a very inefficient sampling strategy in general. Imagine that $p_1 \ne \dots \ne p_k$ and $a_1 = \dots = a_k$, $b_1 = \dots b_k$ and $a_i$'s are close to $b_i$'s, in such case you want to sample to categories with different probabilities, but expect similar frequencies in the end. In extreme case, imagine two-categorical distribution where $p_1 \gg p_2$, and $a_1 \ll a_2$, $b_1 \ll b_2$, in such case you expect something very rare event to happen (real-life example of such distribution would be researcher who repeats sampling until he finds the sample that is consistent with his hypothesis, so it has more to do with cheating than random sampling).

The distribution is much less problematic if you define it as Rukhin (2007, 2008) where you sample $np_i$ cases to each category, i.e. sample proportionally to $p_i$'s.

Rukhin, A.L. (2007). Normal order statistics and sums of geometric random variables in treatment allocation problems. Statistics & probability letters, 77(12), 1312-1321.

Rukhin, A. L. (2008). Stopping Rules in Balanced Allocation Problems: Exact and Asymptotic Distributions. Sequential Analysis, 27(3), 277-292.