Sufficient Statistics – What Does It Mean for a Statistic $T(X)$ to Be Sufficient for a Parameter

intuitionmathematical-statisticsprobabilitysufficient-statistics

I am having a hard time understanding what a sufficient statistic actually helps us do.

It says that

Given $X_1, X_2, …, X_n$ from some distribution, a statistic $T(X)$ is sufficient for a parameter $\theta$ if

$P(X_1, X_2, …, X_n|T(X), \theta) = P(X_1, X_2, …, X_n|T(X))$.

Meaning, if we know $T(X)$, then we cannot gain any more information about the parameter $\theta$ by considering other functions of the data $X_1, X_2, …, X_n$.

I have two questions:

  1. It seems to me that the purpose of $T(X)$ is to make it so that we can calculate the pdf of a distribution more easily. If calculating the pdf yields a probability measure, then why is it said that we cannot "gain any more information about the parameter $θ$"? In other words, why are we focused on $T(X)$ telling us something about $\theta$ when the pdf spits out a probability measure, which isn't $\theta$?

  2. When it says: "we cannot gain any more information about the parameter θ by considering other functions of the data $X_1,X_2,…,X_n$.", what other functions are they talking about? Is this akin to saying that if I randomly draw $n$ samples and find $T(X)$, then any other set of $n$ samples I draw give $T(X)$ also?

Best Answer

I think the best way to understand sufficiency is to consider familiar examples. Suppose we flip a (not necessarily fair) coin, where the probability of obtaining heads is some unknown parameter $p$. Then individual trials are IID Bernoulli(p) random variables, and we can think about the outcome of $n$ trials as being a vector $\boldsymbol X = (X_1, X_2, \ldots, X_n)$. Our intuition tells us that for a large number of trials, a "good" estimate of the parameter $p$ is the statistic $$\bar X = \frac{1}{n} \sum_{i=1}^n X_i.$$ Now think about a situation where I perform such an experiment. Could you estimate $p$ equally well if I inform you of $\bar X$, compared to $\boldsymbol X$? Sure. This is what sufficiency does for us: the statistic $T(\boldsymbol X) = \bar X$ is sufficient for $p$ because it preserves all the information we can get about $p$ from the original sample $\boldsymbol X$. (To prove this claim, however, needs more explanation.)

Here is a less trivial example. Suppose I have $n$ IID observations taken from a ${\rm Uniform}(0,\theta)$ distribution, where $\theta$ is the unknown parameter. What is a sufficient statistic for $\theta$? For instance, suppose I take $n = 5$ samples and I obtain $\boldsymbol X = (3, 1, 4, 5, 4)$. Your estimate for $\theta$ clearly must be at least $5$, since you were able to observe such a value. But that is the most knowledge you can extract from knowing the actual sample $\boldsymbol X$. The other observations convey no additional information about $\theta$ once you have observed $X_4 = 5$. So, we would intuitively expect that the statistic $$T(\boldsymbol X) = X_{(n)} = \max \boldsymbol X$$ is sufficient for $\theta$. Indeed, to prove this, we would write the joint density for $\boldsymbol X$ conditioned on $\theta$, and use the Factorization Theorem (but I will omit this in the interest of keeping the discussion informal).

Note that a sufficient statistic is not necessarily scalar-valued. For it may not be possible to achieve data reduction of the complete sample into a single scalar. This commonly arises when we want sufficiency for multiple parameters (which we can equivalently regard as a single vector-valued parameter). For example, a sufficient statistic for a Normal distribution with unknown mean $\mu$ and standard deviation $\sigma$ is $$\boldsymbol T(\boldsymbol X) = \left( \frac{1}{n} \sum_{i=1}^n X_i, \sqrt{\frac{1}{n-1} \sum_{i=1}^n (X_i - \bar X)^2} \right).$$ In fact, these are unbiased estimators of the mean and standard deviation. We can show that this is the maximum data reduction that can be achieved.

Note also that a sufficient statistic is not unique. In the coin toss example, if I give you $\bar X$, that will let you estimate $p$. But if I gave you $\sum_{i=1}^n X_i$, you can still estimate $p$. In fact, any one-to-one function $g$ of a sufficient statistic $T(\boldsymbol X)$ is also sufficient, since you can invert $g$ to recover $T$. So for the normal example with unknown mean and standard deviation, I could also have claimed that $\left( \sum_{i=1}^n X_i, \sum_{i=1}^n X_i^2 \right)$, i.e., the sum and sum of squared observations, are sufficient for $(\mu, \sigma)$. Indeed, the non-uniqueness of sufficiency is even more obvious, for $\boldsymbol T(\boldsymbol X) = \boldsymbol X$ is always sufficient for any parameter(s): the original sample always contains as much information as we can gather.

In summary, sufficiency is a desirable property of a statistic because it allows us to formally show that a statistic achieves some kind of data reduction. A sufficient statistic that achieves the maximum amount of data reduction is called a minimal sufficient statistic.

Related Question