Solved – Puzzled by definition of sufficient statistics

factorisation-theoremmathematical-statisticssufficient-statistics

I am learning about sufficient statistic from Mood, Graybill, and Boes's Introduction to the Theory of Statistics. I am slightly confused by the book's definition of a sufficient statistic for continuous random variables.

According to the book, for a random sample $X_1, X_2, X_3, …, X_n$ from a distribution $f( ;\theta)$, a statistic $T = g(X_1, X_2, …, X_n)$ is sufficient if the conditional distribution of $(X_1, X_2, …, X_n)$ given $\{ {T = t} \}$ does not depend on theta. This is easy to understand if $X$ is discrete.

However, if $X$ is continuous, the book gives two alternative interpretations:

  1. $T$ is a sufficient statistic if $P(X_1 \leq x_1, X_2 \leq x_2, …, X_n \leq x_n | t-h \leq T \leq t+h)$ does not depend on $\theta$

  2. I find the second interpretation a little weird: $T$ is sufficient if we can bijectively transform $X_1, X_2, X_3, …, X_n$ to $T, Y_2, Y_3, Y_4, …, Y_n$ and show that the conditional distribution of $Y_2, Y_3, Y_3, …, Y_n$ given $T$ does not depend on $\theta$.

My question is, don't we already have a definition for the conditional distribution of $U$ given $V$ for continuous random variables? It is:

$ \frac{f_{U,T}(u,v)}{f_{V}(v)} $

So why not just define sufficient statistics for continuous distributions as a statistic such that $ \frac{f_{\textbf{X},T}(\textbf{x},t)}{f_{T}(t)} $ does not depend on $\theta$? I have seen some lecture notes define sufficient statistics for continuous variables this way.

According to theorem 6.1.1 of Casella and Berger, I think we can really just define $T$ to be sufficient if $ \frac{f_{\textbf{X},T}(\textbf{x},t)}{f_{T}(t)} $ does not depend on $\theta$.

Best Answer

All those interpretations seem to be a variation of expressing the same thing:

The independence of the distribution of the sample $X$ on a true parameter $\theta$ and the statistic $T$.

Which means that the sample $X$, conditional on $T$, does not tell any more information (beyond the information from the statistic $T$) about the parameter $\theta$ because in terms of the frequency/probability distribution of the possible observed samples $X$ there is no difference that will point out anything about $\theta$)


About sufficient statistics

You might be interested to read two works from R.A. Fisher which I believe are very good for didactic purposes (and also good for getting to know the classics)

  • A mathematical examination of the methods of determining the accuracy of an observation by the mean error, and the mean square error. (1920)

    • Here Fisher compares different statistics to estimate the $\sigma$ parameter of the normal distribution.
    • He expresses the relative sampling variance (relative standard error) for different forms of deviation. That is the mean error $\sigma_1 = \sqrt{\frac{2}{\pi}}\sum (|x-\bar{x}|)$, the mean square error, $\sigma_2 = \sum (x-\bar{x})^2$, and any variants of sums of error employing any other power $\sigma_p$.
    • He finds out that the mean square error has the lowest relative standard error, and he explores further the special properties of the mean squared error.
    • He then expresses the distribution/frequency of the one statistic based on the other statistic and observes that the mean squared error, $\sigma_2$, is special because the distribution of $\sigma_1$ or other $\sigma_p$, conditional on $\sigma_2$, does not depend on $\sigma$. That means that no other statistic $\sigma_p$ will be able to tell anything more about the parameter $\sigma$ than what $\sigma_2$ tells about $\sigma$.
    • He mentions the iso-surface of the statistic corresponding to the iso-surface of the likelihood function and derives for the mean error statistic, $\sigma_1$, (which are not n-spheres but multidimensional polytopes) that this coincides with the Laplace distribution (a little bit analogous how Gauss derived the normal distribution based on the root mean square statistic)
  • Theory of statistical estimation (1925)

    Here Fisher explains several concepts such as consistency and efficiency. In relation to the concept of sufficiency he explains

    • the 'factorization theorem'
    • and the fact that a sufficient statistic, if it exists, will be a solution of the equations to obtain the maximum likelihood.

    The explanation of sufficiency is particular clear by the use of the Poisson distribution as an example. The probability distribution function for a single observation $x$ is $$f(x) = e^{-\lambda} \frac{\lambda^x}{x!}$$ and the joint distribution of $n$ independent observations is $\lbrace x_1, x_2,...,x_n \rbrace$: $$f(x_1,...,x_n) = e^{-n\lambda} \frac{\left( \lambda \right)^{n\bar{x}}}{\left( n\bar{x} \right)!}$$ which can be written as $$f(x_1,...,x_n) = e^{-n\lambda} \frac{\lambda^{n\bar{x}}}{x_1!x_2!...x_n!} $$ and factorized into $$f(\bar{x}) \cdot f(x_1,...,x_n |\bar{x}) = e^{-n\lambda} \frac{(n\lambda)^{n\bar{x}}}{\left( n\bar{x} \right)!} \cdot \frac{\left( n\bar{x} \right) !}{n^{n\bar{x}}x_1!x_2!...x_n!} $$ which is the multiplication of (1) the distribution function of the statistic $f(\bar{x})$ and (2) the distribution function of the partitioning of $f(\bar{x})$ into $x_1,...,x_n$ which you could intuitively see as a conditional distribution density $f(x_1,...,x_n |\bar{x})$. Note that the latter term does not depend on $\lambda$.


Related to your two interpretations

  1. If the PDF $f(x_1,...,x_n |\bar{x})$ is independent from $\theta$ then so should be the integrated probability (the CDF): $$\begin{multline}P(a_1<X_1<b_1,...,a_n<X_n<b_n |\bar{x}) = \\ = \int_{x_1 = a_1}^{x_1 = b_1} ... \int_{x_n =a_n}^{x_n = b_n} f(x_1,...,x_n |\bar{x}) d x_1 d x_2 ... d x_n\end{multline}$$

  2. You suggest to just use $\frac{f_{X,T}(x,t)}{f_T(t)}$ but sometimes it might not always be so easy to make such expression. The factorization works already if you can split the likelihood into: $$f(x_1,...,x_n|\theta) = h(x_1,...,x_n) \cdot g(T(x)|\theta) $$ where only the factor $g(T(x)|\theta)$ depends solely on the parameter(s) $\theta$ and the statistic $T(x)$. Now note that it doesn't really matter how you express $h(x)$ you can just as well express this function in terms of other coordinates $y$ that relate to $x$ as long as the part is independent from $\theta$.

    For instance the factorization with the Poisson distribution could have already been finished by writing: $$f(x_1,...,x_n) = \underbrace{e^{-n\lambda} \lambda^{n\bar{x}} \vphantom{\frac{1}{x_1!x_2!...x_n!}}}_{g(T(x)|\theta)} \cdot \underbrace{\frac{1}{x_1!x_2!...x_n!}}_{h_x(x_1,...,x_n)}$$ were the first term only depends on $\bar{x}$ and $\lambda$ and the second term does not depend on $\lambda$. So there is no need to look further for $\frac{f_{X,T}(x,t)}{f_T(t)}$

    In this second example there is also a drop of one variable. You do not have $Y_1...Y_n$ but one less $Y_2..Y_n$. One example where this could be useful is when you use $T = \max \lbrace X_i \rbrace$ as statistic for a sample from the uniform distribution, $X_i \sim U(0,\theta)$. If you denote $Y_i$ the i-th largest from the $X_i$ then it is very easy to express the conditional probability distribution $P(Y_i \vert T)$. But to express $P(X_i \vert T)$ is a bit more difficult (see Conditional distribution of $(X_1,\cdots,X_n)\mid X_{(n)}$ where $X_i$'s are i.i.d $\mathcal U(0,\theta)$ variables ).


What your textbook says

Note that your textbook already explains why it is giving these alternative interpretations.

In the case of sampling from a probability density function, the meaning of the term "the conditional distribution of $X_1, ... , X_n$ given $S=s $" that appears in Definition 15 may not be obvious since then $P[S=s]=0$

and the alternative interpretations do not relate so much to 'the concept of sufficiency' but more to 'the concept of a probability density function not directly expressing probabilities'.

  1. The expression in terms of the cumulative density function (which does relate to a probability) is one way to circumvent it.

  2. The expression in terms of the transformation is a particular way to express the partitioning theorem. Note that $f(X_1,...,X_n)$ is dependent on $\theta$ but $f(Y_2,...,Y_n)$, where the $T$ term is separated, is independent from $\theta$ (e.g. the example in the book shows that for normal distributed variables, with unknown mean $\mu$ and known variance 1, the distribution of $Y_i = X_i-X_1$ is according to $Y_i \sim N(0,2)$, thus independent from the $\mu$).

  3. A variation of the second interpretation (which dealt with the trouble that $f(X_1,...,X_n)$ is not independent from $\theta$) could be to show that $f(X_1,...,X_n)$ is independent from $\theta$ when constrained to the iso-surface where the sufficient statistic is constant.

    This is sort of the geometrical interpretation that Fisher had. I am not sure why they use the more confusing interpretation. Possibly one may not see this interpretation, a sort of conditional probability density function that is analogous to a conditional probability, as theoretically clean.


About the expression $\frac{f_{X,T}(\mathbf{x},t)}{f_T(t)}$

Note that $f_{X,T}(\mathbf{x},t)$ is not easy to express since, $T$ depends on $\mathbf{X}$, and not every combination of $\mathbf{x}$ and $t$ is possible (so you are dealing with some function that is only non-zero on some surface in the space $\mathbf{X},T$ where $t$ and $\mathbf{x}$ are correctly related).

If you drop one of the variables in the vector $\mathbf{x}$ then it does become more suitable and this is very close to the conversion to parameters $y$ where you also have one number less.

However this devision is not too strange. The sufficient statistic is the one for which the distribution function $f_{X,T}(\mathbf{x},t)$ is constant (for different $\mathbf{x}$ the probability density $f_\mathbf{X}(\mathbf{x})$ is the same, constant, if $T$ is the same) so you should be able to divide it out (but the same works with any other function $g(t,\theta)$ it doesn't necessarily need to be the probability distribution $f_T(t,\theta)$.

Related Question