Finding two dimensional sufficient statistics for support set $\{1,2,3\}$.

probability distributionsstatistical-inferencestatistics

Let $X_1, \ldots, X_n$ be a random i.i.d sample from and arbitrary discrete distribution $p$ on $\{1,2,3\}$. Find a two-dimensional sufficient statistic.

My try: For each random variable $X_i$ we can write the following
\begin{equation}
\mathbb{P}(X_i=k)=
\begin{cases}
p_1 \quad \quad \quad \quad \quad k=1\\
p_2 \quad \quad \quad \quad \quad k=2\\
1-p_1-p_2 \quad k=3
\end{cases}
\end{equation}

Now to define likelihood $$L(X, p_1, p_2)= \prod_1^n p_1^{I(x_i=1)}p_1^{I(x_i=2)}(1-p_1-p_2)^{I(x_i=3)}$$
where $X=(X_1, \ldots, X_n)$ and $I$ is indicator function.
I do not know how to simplify the likelihood and use it to get sufficient statistic.

Best Answer

Step back a bit and think about what this means in practice.

Note that the distribution $p$ is just a discrete categorical distribution; i.e., $$\Pr[X = i] = p_i, \quad i \in \{1, 2, 3\},$$ for some probabilities $p_i$ satisfying $0 \le p_i \le 1$ and $p_1 + p_2 + p_3 = 1$. This is essentially what you have written.

Now, say I know the values $p_1, p_2, p_3$, and I use these to generate a random sample $X_1, \ldots, X_n$. If I tell you that $n = 7$ and the sample is $$(1, 2, 1, 1, 3, 1, 2),$$ does it matter in which positions each $1$, $2$, and $3$ is observed? If I instead told you simply how many of each number is present in the sample (in this case, it would be $4$ ones, $2$ twos, and $1$ three), does this discard any information about the $p_i$ that was present in the original sample?

So, thinking in this way, it should become clear that you can achieve data reduction if we simply keep track of the number of occurrences of each value in the sample. In fact, we don't even need to keep track of all three values, since for example once we know that there were $4$ ones and $2$ twos, and the sample size $n = 7$ is given, the number of threes is uniquely determined. This is analogous to the idea that in such a distribution, there are in fact only two free parameters, since there is the condition $p_1 + p_2 + p_3 = 1$.

Now, what we need to do is to use notation to express our intuition above. We will write the PMF for $X$ in this fashion: $$\Pr[X = x] = \prod_{j=1}^3 p_j^{\mathbb 1 (x = j)}.$$ This is equivalent to the way we wrote it originally, and it's consistent with your computation. So then our joint likelihood for the sample is $$\mathcal L(p_1, p_2, p_3 \mid \boldsymbol x) = \prod_{i=1}^n \Pr[X_i = x_i] = \prod_{i=1}^n \prod_{j=1}^3 p_j^{\mathbb 1 (x_i = j)} = \prod_{j=1}^3 p_j^{\sum_{i=1}^n \mathbb 1 (x_i = j)}.$$ Of course we can write this out and explicitly eliminate $p_3$ and the statistic $\sum_{i=1}^n \mathbb 1(x_i = 3)$: $$\mathcal L(p_1, p_2 \mid \boldsymbol x) = p_1^{\sum_{i=1}^n \mathbb 1(x_i = 1)} p_2^{\sum_{i=1}^n \mathbb 1(x_i = 2)} (1 - p_1 - p_2)^{n - \sum_{i=1}^n 1(x_i = 1) - \sum_{i=1}^n \mathbb 1(x_i = 2)}.$$ Now we can apply the Factorization theorem: $\boldsymbol T(\boldsymbol x)$ is sufficient for $\boldsymbol \theta = (p_1, p_2)$ if $$\mathcal L(p_1, p_2 \mid \boldsymbol x) = h(\boldsymbol x) g(\boldsymbol T(\boldsymbol x) \mid p_1, p_2).$$ We choose $$\boldsymbol T(\boldsymbol x) = \left( \sum_{i=1}^n \mathbb 1(x_i = 1), \sum_{i=1}^n \mathbb 1(x_i = 2)\right),$$ that is to say, the two-dimensional sufficient statistic is simply the ordered pair of the number of ones and twos in the sample, respectively; and the remaining choices are $$\begin{align*} h(\boldsymbol x) &= 1, \\ g(\boldsymbol T \mid p_1, p_2) &= p_1^{T_1} p_2^{T_2} (1-p_1 - p_2)^{n - T_1 - T_2}. \end{align*}$$

Related Question