Sufficient statistic of a discrete distribution that can take $n$ values.

statistical-inferencestatistics

Use factorization criterion to determine a sufficient statistic bassed on a sample of size N, where each observation come frome a family of distributions of a random variable that can take a finite list of values $x_1,\cdots,x_n$ with probability $p_1,\cdots,p_n,$ respectively.

My attempt

The factorization theorem establish that for a random sample $\vec{x}=(x_1,\cdots,x_n)$, is possible write:
$$f(\vec{x}|\theta)=g(T(\vec{x})|\theta)h(\vec{x}),\ \ \ \ \ \ \ \ \ \ \ \ \ (1)$$

where $f(\vec{x}|\theta)$ is the sample density, $g(T(\vec{x})|\theta)$ is a function that depends on theta and $h(\vec{x})$ does not depend on $\theta.$

One way to write the discrete density of given random variable is via indicator functions as follows

$$f(X=y_1|p_1,\cdots,p_n)=p_1{\bf 1}_{x_1}(y_1)+\cdots p_n{\bf 1}_{x_n}(y_1).$$

so I'm trying to write the sample distribution of an arbitrary sample $y_1,y_2,\cdots,y_N$:

\begin{eqnarray}
f(y_1,\cdots,y_N|p_1,\cdots,p_n)&=&\displaystyle\Pi_{i=1}^Nf(y_i|p_1,\cdots,p_n)\\
&=&\displaystyle\Pi_{i=1}^N(\sum_{i=1}^n p_i{\bf{1}}_{x_i}(y_i)).
\end{eqnarray}

It's possible to write the previous equation in the form of (1)?.

Best Answer

There is not suitable to write density in the form of a sum. Rewrite it as a product: $$ f(X=y_1|p_1,\cdots,p_n)=p_1^{{\bf 1}_{x_1}(y_1)}\cdots p_n^{{\bf 1}_{x_n}(y_1)}. $$ And then \begin{eqnarray} f(y_1,\cdots,y_N|p_1,\cdots,p_n)&=&\prod_{i=1}^Nf(y_i|p_1,\cdots,p_n)\\ &=&\prod_{i=1}^N\prod_{j=1}^n p_j^{{\bf{1}}_{x_j}(y_i)} \\ &=& \prod_{j=1}^n p_j^{\sum_{i=1}^N{\bf{1}}_{x_j}(y_i)}. \tag{2} \end{eqnarray}

This is a function that depends on $p_1,\ldots,p_n$ and on a sufficient statistics $$ T(y_1,\ldots,y_n) = \left(\sum_{i=1}^N{\bf{1}}_{x_1}(y_i),\ldots,\sum_{i=1}^N{\bf{1}}_{x_{n-1}}(y_i)\right). $$ Note that really there are $n-1$ unknown parameters since $p_n=1-p_1-\ldots-p_{n-1}$. Also $T$ is $(n-1)$-dimensional since the missed last sum of indicators can be found from the others: $$ \sum_{i=1}^N{\bf{1}}_{x_n}(y_i) = N-\sum_{j=1}^{n-1}\sum_{i=1}^N{\bf{1}}_{x_j}(y_i). $$ The expression (2) can be rewritten as $$ f(y_1,\cdots,y_N|p_1,\cdots,p_n) = \prod_{j=1}^{n-1} p_j^{\sum_{i=1}^N{\bf{1}}_{x_j}(y_i)}\times (1-p_1-\ldots-p_{n-1})^{N-\sum_{j=1}^{n-1}\sum_{i=1}^N{\bf{1}}_{x_j}(y_i)} $$ $$ =g(T,p_1,\ldots,p_{n-1})\cdot \underbrace{h(y_1,\ldots,y_n)}_1. $$

Related Question