Solved – Understanding multinomial distribution

distributionsmathematical-statisticsmultinomial-distributionprobability

There are K categories. Let x be a discrete random variable taking on values $1,2…,K$.
Given data $\textbf{X}= \{x_1, x_2, …, x_N\}$, and suppose $\textbf{p}$ is a vector probabilities where, $p(x= k) = p_k$.

Normally, we see the following:

  1. Suppose $N_k = \sum_i I(x_i = k)$, i.e. count number of times outcome k occurs in n trials. And we model the set of counts $N_k$ as multinomial distribution.

$$p(N_1…N_k | \textbf{p}) = \frac{N!}{N_1!…N_K!} \prod^K_{k=1} {p_k}^{N_k}$$

This is what we see normally.

However, I came across something interesting on the web and my question is the following: can the following be a multinomial distribution ? And why ?

What if we model $\textbf{X}$ as multinomial distribution.
$$p(\textbf{X} | \textbf{p}) = \prod^{K}_{k=1} {p_k}^{N_k}$$

Is it still a valid pmf for multinomial distribution ? At first ,I thought it was impossible because it lacks permutation, which is the normalizing constant that makes the pmf sum to 1. But the research paper indicates both of these are multinomial distribution, although they are modelling different things.

Best Answer

Suppose you roll a 6-sided die $N$ times.

The outcome of roll $i$, $i=1,\ldots,N$, is represented by the random variable $X_i$. The tuple $\mathbf{X}=\left(X_1,\ldots,X_N\right)$ contains the outcome of each roll.

We can obtain category-level count information from $\mathbf{X}$ by taking $N_j=\sum_{i=1}^{N}\delta\left(X_i=j\right)$, $j=1,\ldots,6$. The tuple $\mathbf{N}=\left(N_1,\ldots,N_6\right)$ contains the counts for each category.

What's the difference between having $\mathbf{X}$ and $\mathbf{N}$? They both arise from $N$ trials of a multinomial distribution with six possible outcomes, each with equal probability of occurring. However, when we discuss probability with respect to $\mathbf{X}$ we are talking about the probability of a specific sequence of outcomes. When we discuss probability with respect to $\mathbf{N}$ we are talking about the probability of a specific set of counts. There is a normalizing factor with the trial-level information, but it's just $1$ because there is only one way to get any specific sequence of outcomes.

EDIT The second section of the paper actually discusses when to use counts and when to use samples.

Related Question