[Math] Sufficient statistics vs. Bayesian sufficient statistics

probabilitystatistics

Given sample data $x_1, \ldots, x_n$ generated from a probability distribution $f(x|\theta)$ ($\theta$ being an unknown parameter), a statistic $T(x_1, \ldots, x_n)$ of the sample data is called sufficient if $f(x|\theta, t) = f(x|t)$.

However, I'm always kinda confused by this definition, since I think of a sufficient statistic as a function that gives just as much information about $\theta$ as the original data itself (which seems a little different from the definition above).

The definition of Bayesian sufficiency, on the other hand, does mesh with my intuition: $T$ is a Bayesian sufficient statistic if $f(\theta|t) = f(\theta|x)$.

So why is the first definition of sufficiency important? What does it capture that Bayesian sufficiency doesn't, and how should I think about it?

[Note: I believe that every sufficient statistic is also Bayesian sufficient, but not conversely (the reverse implication doesn't hold in the infinite-dimensional case, according to Wikipedia).]

Best Answer

this response addresses part of grautur's question: how to think about the property of being "sufficient".

ra fisher suggested the terminology "sufficient" for a statistic $t$ that satisfies the [heuristic] requirement $$\kern{-2.5in}(1)\kern{2.5in} f(x|\theta, t) = f(x|t).$$

[there is nothing heuristic about this way of writing the requirement if $x$ is discrete. but if $x$ is continuous, the pdf notation $f(x|t)$ is usually heuristic. an example of the latter is when $x = (x_1,\dots x_n)$, where the $\{x_i\}$ are $iid$ N($\theta$,1). here the sample mean $\bar x$ is a sufficient statistic and $(1)$ requires considering the conditional pdf $$\kern{-.5in} (2)\kern{.5in} f(x_1,\dots, x_n|\bar x) \equiv f(x_1 - \bar x,\dots, x_n - \bar x | \bar x) = f(x_1 - \bar x,\dots, x_n - \bar x).$$ the last equality in $(2)$ holds since the sample deviations are independent of the sample mean in the normal case. (this fact gives an instant proof that the sample mean and variance are independent in the normal case.) here the "pdf" on the RHS of (2) does not exist as an $n-$dimensional pdf since the joint distribution of the sample deviations is singular - they sum to zero. replacing "$f$" in (1) by "dist", removes its heuristic nature.]

at any rate, the idea of (1) is that, as the conditional distribution of $x|t$ does not depend on $\theta$, one can [in principle] generate a new $x$ - call it $x^*$, say, from the known conditional distribution of $x|t$, so that $x^* \sim x$ for all $\theta$.

as an illustration, consider the normal example above. here, obtaining an $x^*$ is particularly easy since the deviations are independent of $\bar x$. thus one can generate $n$ iid N(0,1) variables $z_1,\dots, z_n$ and let $x^* = (x_1^*,\dots. x_n^*)$, where $x_i^* = z_i - \bar z + \bar x : 1\le i\le n$. clearly $x \sim x^*$ for all values of $\theta$, so that $x^*$ is just as "good" a sample from the population as the original $x$ for learning about $\theta$. clearly no one would actually want to use the sample deviations for $x^*$ in addition to its sufficient statistic $\bar x^* = \bar x$, as those deviations, the $\{z_i - \bar z\}$, were obtained by a completely extraneous random experiment having nothing to do with the actual data process. one then should agree that the original sample deviations $\{x_i - \bar x\}$ should also not be used, since they can be considered as being generated in the same way, by an extraneous random experiment, where now nature, rather than the statistician, did the generating.

Related Question