Your statement of the Pitman-Koopman-Darmois theorem is off; there is an additional assumption that the support of $\mathcal X$ does not change as $\theta$ changes where $\mathcal X$ is the support of $X_1$ and $\theta$ parameterizes the family. A quick counterexample to the statement of the theorem as given in the OP is the family $\{\mbox{Uniform}(0, \theta): \theta > 0\}$ for which $\max\{X_j, 1 \le j \le n\}$ is sufficient and does not vary with the sample size $n$.

More in the spirit of your question, the answer is yes, there do exist distributions who sufficient statistics are "lossy" even when the conditions of the PKD theorem are satisfied. Consider $X_1, X_2, ...$ iid from a Gamma distribution with shape parameter $\alpha$ (known) and mean $\mu$, and $Z_1, Z_2, ...$ iid Bernoulli with success probability $p$ also known. Then take $Y_i = Z_i X_i - (1 - Z_i)$, and our sample becomes $Y_1, Y_2, ...$. We only get information about $\mu$ when $Y_i \ne -1$, so our sufficient statistic is $\sum_{i: Y_i \ne -1} Y_i$ which grows like $pn$ on average.

Sublinear growth is possible, proceeding along the train of thought suggested above, i.e. using mixtures of distribution, and indeed this is getting at something that is useful in practice. Take $(X_1, Z_1), (X_2, Z_2), ...$ to be iid distributed according to an infinite mixture of normals $f(x, z | \pi, \mu) = \prod_{i = k} ^ \infty [\pi_k N(x | \mu_k, 1)]^{I(Z_i = k)}$, with $Z_i$ an indicator of which cluster $X_i$ is in (I'm not sure if the representation of it via a density that I wrote is valid but you should get the general idea); the dimension of the sufficient statistics should increase only when new clusters are discovered, and the rate of the appearance of new clusters can be controlled by taking $\{\pi_k\}_{k = 1} ^ \infty$ to be known and carefully choosing them; my hunch is that it should be easy to make it grow at a rate of $\log(1 + Cn)$ since I think this is how fast the number of clusters grows in the Dirichlet process.

**First question**

Inspecting the definition of the exponential family
$$
f_x(x;\theta) = c(\theta) g(x) e^{ \sum_{j=1}^l G_j(\theta) T_j(x) },
$$
one can say the following:

$T$ is a sufficient statistic. Condition on $T$, the conditional distribution is $g(x)$ (up to a normalization constant), which is independent of the parameter $\theta$. This is the definition of sufficiency. In fact, for the exponential family it is independent of $T$.

The term $e^{ \sum_{j=1}^l G_j(\theta) T_j(x) }$ determines the marginal distribution of $T$, via the choice of $G_j$'s.

$c(\theta)$ is a normalization constant so the density integrates to $1$.

**Second question**

As $G_j$'s are arbitrary, subject to measurability requirements etc., there is no general formula for computing moments. For the Poisson distribution, the first moment is simply
$$
e^{-\lambda} \sum_{k = 0}^{\infty} k \frac{\lambda^k}{k!}
= ( e^{-\lambda} \sum_{k = 1}^{\infty} \frac{\lambda^{k-1} }{(k-1)!}) \cdot \lambda = \lambda.
$$

The second moment is similar.

## Best Answer

Sufficient statistics don't need to estimate anything. They merely pertain to data reduction; i.e., no information about the parameter that can be inferred from the sample is discarded. Every sample has, for example, the trivial sufficient statistic which is the sample itself--no data reduction is accomplished, but no information is lost, either.

The notion of sufficiency is relevant to estimation because sufficiency is often a desirable property of an estimator. However, it is not the only desirable property. Unbiasedness is also desirable, for example; however, an unbiased statistic need not be sufficient; e.g., $X_1, \ldots, X_n$ are iid random variables drawn from a parametric distribution with finite mean $\mu$ and variance $\sigma^2$; the sample mean $\bar X$ is an estimator of $\mu$ but so is $X_1$. Both are unbiased for $\mu$, but the latter is not sufficient for $\mu$.

Consistency is also a desirable property, but here again we can easily construct consistent but insufficient statistics; e.g., $(X_1 + \cdots + X_{n-1})/(n-1)$, the mean of the sample that omits the last observation, is consistent and unbiased but not sufficient because it has discarded information about $\mu$ that was present in the original sample.

So the question of what we mean by "good" when we say "good estimators" is in fact not only relevant to the question, but in a sense, it is the crux of the question you are asking. That is to say, if by "good" we mean such an estimator must not needlessly discard data, then that is how we motivate and ultimately define the notion of a sufficient statistic in the first place. If by "good" we mean some other notion, then as you can see, it is not difficult to construct estimators that fail to be sufficient yet can exhibit a number of other desirable properties, simply by omitting one observation in the sample.