this response addresses part of grautur's question: how to think about the property of being "sufficient".
ra fisher suggested the terminology "sufficient" for a statistic $t$ that satisfies the [heuristic] requirement $$\kern{-2.5in}(1)\kern{2.5in} f(x|\theta, t) = f(x|t).$$
[there is nothing heuristic about this way of writing the requirement if $x$ is discrete. but if $x$ is continuous, the pdf notation $f(x|t)$ is usually heuristic. an example of the latter is when $x = (x_1,\dots x_n)$, where the $\{x_i\}$ are $iid$ N($\theta$,1). here the sample mean $\bar x$ is a sufficient statistic and $(1)$ requires considering the conditional pdf $$\kern{-.5in} (2)\kern{.5in} f(x_1,\dots, x_n|\bar x) \equiv f(x_1 - \bar x,\dots, x_n - \bar x | \bar x) = f(x_1 - \bar x,\dots, x_n - \bar x).$$ the last equality in $(2)$ holds since the sample deviations are independent of the sample mean in the normal case. (this fact gives an instant proof that the sample mean and variance are independent in the normal case.) here the "pdf" on the RHS of (2) does not exist as an $n-$dimensional pdf since the joint distribution of the sample deviations is singular - they sum to zero. replacing "$f$" in (1) by "dist", removes its heuristic nature.]
at any rate, the idea of (1) is that, as the conditional distribution of $x|t$ does not depend on $\theta$, one can [in principle] generate a new $x$ - call it $x^*$, say, from the known conditional distribution of $x|t$, so that $x^* \sim x$ for all $\theta$.
as an illustration, consider the normal example above. here, obtaining an $x^*$ is particularly easy since the deviations are independent of $\bar x$. thus one can generate $n$ iid N(0,1) variables $z_1,\dots, z_n$ and let $x^* = (x_1^*,\dots. x_n^*)$, where $x_i^* = z_i - \bar z + \bar x : 1\le i\le n$.
clearly $x \sim x^*$ for all values of $\theta$, so that $x^*$ is just as "good" a sample from the population as the original $x$ for learning about $\theta$. clearly no one would actually want to use the sample deviations for $x^*$ in addition to its sufficient statistic $\bar x^* = \bar x$, as those deviations, the $\{z_i - \bar z\}$, were obtained by a completely extraneous random experiment having nothing to do with the actual data process. one then should agree that the original sample deviations $\{x_i - \bar x\}$ should also not be used, since they can be considered as being generated in the same way, by an extraneous random experiment, where now nature, rather than the statistician, did the generating.
The joint probability density function (pdf) of the sample is
$$\prod_{i=1}^n (1-\pi)^{x_i} \pi = \pi^n (1-\pi)^{\sum_{i=1}^n x_i}.$$
Let $T(x) = \sum_{i=1}^n x_i$. We see that the pdf can be expressed as a function that depends on the sample only through $T(x)$. Thus $T(x)$ is a sufficient statistic for $\pi$, by the Neyman-Fisher factorization theorem.
The Neyman-Fisher factorization theorem is a very useful tool for finding sufficient statistics. The Wikipedia page has several examples.
(This answer assumes one form of the geometric distribution. The form of $T(x)$ would be the same for the other.)
Best Answer
Your definition of sufficiency is correct.
Sufficiency pertains to data reduction, not merely estimation. A sufficient statistic need not estimate anything. For example, if $X_1, \ldots, X_n$ are iid samples drawn from an exponential distribution with unknown mean $\theta$, then $\bar X$ is sufficient for $\theta$, but so is $(X_1 + \cdots + X_{n-1}, X_n)$. The former achieves greater data reduction--the latter achieves less reduction, since it consists of two numbers. The former is itself an estimator of $\theta$; the latter does not estimate $\theta$ directly; you need to transform it somehow: you could, for example, decide to make the estimator $X_n$ from this sufficient statistic, but this estimator is not sufficient nor is it a particularly "good" estimator.
The purpose of sufficiency is to demonstrate that statistics that satisfy this property do not discard information about the parameter, and as such, estimators that might be based on a sufficient statistic are in a sense "good" ones to choose.
In regard to your second question, let's go back to the exponential example. A non-sufficient statistic that was mentioned was $X_n$. This statistic simply discards all the previous observations and keeps only the last. And yes, it does estimate $\theta$: note $\operatorname{E}[X_n] = \theta$ by definition, and so it is even an unbiased estimator. But does it perform very well? No; its asymptotic variance is constant and independent of the sample size, meaning that no matter how large a sample size you choose, this estimator never gets any closer to estimating the true value of $\theta$ on average--and of course, this makes intuitive sense. You've discarded all the previous observations.
A better estimator would be to take the mean of all the odd-numbered observations; e.g., $(X_1 + X_3 + \cdots + X_{2n-1})/(2n-1)$, and yes, this too is an unbiased estimator of $\theta$. Still, you can see why it's not as good as the mean of all the observations. It does achieve data reduction, but since it is not a sufficient statistic, it "wastes" too much. That's what being able to show sufficiency gets you; if an estimator is sufficient, it isn't "wasteful."