You'd probably benefit from reading about sufficiency in any textbook on theoretical statistics, where most of these questions will be covered in detail. Briefly ...
Not necessarily. Those are special cases: of distributions where the support (the range of values the data can take) doesn't depend on the unknown parameter(s), only those in the exponential family have a sufficent statistic of the same dimensionality as the number of parameters. So for estimating the shape & scale of a Weibull distribution or the location & scale of a logistic distribution from independent observations, the order statistic (the whole set of observations disregarding their sequence) is minimal sufficient—you can't reduce it further without losing information about the parameters. Where the support does depend on the unknown parameter(s) it varies: for a uniform distribution on $(0,\theta)$, the sample maximum is sufficient for $\theta$; for a uniform distribution on $(\theta-1,\theta+1)$ the sample minimum and maximum are together sufficient.
I don't know what you mean by "direct correspondence"; the alternative you give seems a fair way to describe sufficient statistics.
Yes: trivially the data as a whole are sufficient. (If you hear someone say there's no sufficient statistic they mean there's no low-dimensional one.)
Yes, that's the idea. (What's left—the distribution of the data conditional on the sufficient statistic—can be used for checking the distributional assumption independently of the unknown parameter(s).)
Apparently not, though I gather the counter-examples are not distributions you're likely to want to use in practice. [It'd be nice if anyone could explain this without getting too heavily into measure theory.]
In response to the further questions ...
The first factor, $ \mathrm{e}^{-n\lambda}\cdot\lambda^{\sum{x_i}}$, depends on $\lambda$ only through $\sum x_i$. So any one-to-one function of $\sum x_i$ is sufficient: $\sum x_i$, $\sum x_i/n$, $(\sum x_i)^2$†, & so on.
The second factor, $\tfrac{1}{x_1! x_2! \ldots x_n!}$, doesn't depend on $\lambda$ & so won't affect the value of $\lambda$ at which $f(x;\lambda)$ is a maximum. Derive the MLE & see for yourself.
The sample size $n$ is a known constant rather than a realized value of a random variable‡, so isn't considered part of the sufficient statistic; the same goes for known parameters other than the ones you want to infer things about.
† In this case squaring is one-to-one because $\sum x_i$ is always positive.
‡ When $n$ is a realized value of the random variable $N$, then it will be part of the sufficient statistic, $(\sum x_i,n)$. Say you choose a sample size of 10 or 100 by tossing a coin: $n$ tells you nothing about the value of $\theta$ but does affect how precisely you can estimate it; in this case it's called an ancillary complement to $\sum x_i$ & inference can proceed by conditioning on its realized value—in effect ignoring that it might have come out different.
The symbols $\boldsymbol{x,y}$ refer to data. Associated with each possible value $v$ of the statistic $S$ is a collection of possible data values $S^{-1}(v)$ for which $S$ has the value $v$. Since $T$ has the same value (call it $w$) on every such dataset, we may define $H(v) = w$.
The case $S(\boldsymbol{x}) \ne S(\boldsymbol{y})$ is irrelevant to the theorem.
Let's interpret the theorem. Say that two datasets $\boldsymbol{x,y}$ are equivalent if the relative likelihood function
$$\theta \to \frac{L(\theta,\boldsymbol{x})}{L(\theta,\boldsymbol{y})}$$
is constant. This means that any analysis based on comparing likelihoods (for different values of $\theta$) will not make any distinction between any two equivalent datasets. The theorem informs us that a minimal sufficient statistic will not ever distinguish between two equivalent datasets (that is, it must have the same value on each).
The proof of the theorem proceeds by noting that any two datasets having the same value of $S$ must be equivalent (provided that $S$ is sufficient) and therefore $T$ will have the same value on those datasets.
We might picture this by supposing that this equivalence relation among datasets partitions $\Omega$ into separate, overlapping components, each being a collection of equivalent datasets. Sufficient statistics have different values on different components: this guarantees that they can discriminate among inequivalent datasets. However, their values within any given component might vary (thereby discriminating among some equivalent datasets, too). Any minimal sufficient statistic, though, will be constant on each component: it will not discriminate between two equivalent datasets.
The following is a formal mathematical demonstration that $T$ is a function of $S$.
Let the set of all possible such data be $\Omega$. A statistic, such as $S$ or $T$, assigns some kind of mathematical object to each dataset $\boldsymbol{x}\in\Omega$--such as a number or vector--that we can calculate with. The details don't matter, so suppose $S$ assigns objects in a set $V$ and $T$ assigns objects in a set $W$. If the function $H$ exists, then it is a map from $V$ to $W$ (depending on $S$, by the way):
$$\begin{array}{rrcll}
\; &&\Omega \\
\; &^S\swarrow & &\searrow^T \\
\; V & &\xrightarrow{H} &&W \\
\end{array}$$
Given $T$ and $S$ as in the question, what we know is that
For all $\boldsymbol{x,y}\in\Omega$, $S(\boldsymbol{x}) = S(\boldsymbol{y})$ implies $T(\boldsymbol{x}) = T(\boldsymbol{y})$.
From this we would like to deduce the existence of a function $H:V\to W$ such that $T = H\circ S$: that is, $T(\boldsymbol{x}) = H(S(\boldsymbol{x}))$ for all $\boldsymbol{x}\in\Omega$.
$H$ can be found with an explicit construction. One way begins with a function $H^{*}$ defined on $V$ whose values are subsets of $W$ defined as
$$H^{*}(v) = T(S^{-1}(v)) = \{T(\boldsymbol{x})\,|\, S(\boldsymbol{x}) = v\}.$$
I claim that all elements of $H^{*}(v)$ are equal, no matter what $v\in V$ might be. To prove the claim let $u, w\in H^{*}(v)$. By definition, this means there are $\boldsymbol{x,y}\in\Omega$ such that $u=T(\boldsymbol{x})$ and $w=T(\boldsymbol{y})$ and $S(\boldsymbol{x}) = S(\boldsymbol{y}) = v$. The latter equality implies $u = T(\boldsymbol{x}) = T(\boldsymbol{y}) = w$, proving the claim.
This claim enables us to define $H(v)$ whenever $H^{*}(v)$ is nonempty: it's the unique element of $H^{*}(v)$. To complete the definition of $H$, pick an arbitrary $w_0\in W$ and set $H(v) = w_0$ when $H(v)$ is empty. Formally,
$$H(v) = \left\{ \begin{array}{ll}
w & \text{if } H^{*}(v) = \{w\}\\
w_0 & \text{if } H^{*}(v) = \emptyset.
\end{array} \right. $$
Best Answer
Let the sample space be $\mathcal{X}$. Then a sufficient statistic $T$ can be seen as indexing a partition of $\mathcal{X}$, that is, $T(x)=T(y)$ iff (if and only if) $x,y$ belongs to the same element of the partition. A minimallly sufficient statistic is then giving a maximal reduction of the data. That is to say, if $T$ is minimally sufficient, then if we take the partition corresponding to $T$, take two distinct elements of that partition, and makes a new partition by replacing the two by their union, the resulting statistic is not longer sufficient. So, any other sufficient statistic, say $S$, which is not minimal, will have a partition which corresponds to a refinement of the partition of $T$, that is, every element of the partition of $T$ is a union of elements of the partition of $S$ (this becomes easier to understand if you make a drawing from my text!). So, when you know the value of $S$, you know in which element of the partition of $S$ that sample point belongs, and also in which element of the partition of $T$ that sample point belongs — since that partition is coarser. That is what it means when it says that $T$ is a function of every other sufficient statistic — every other sufficient statistic gives more information (or the same information) about the sample than what $T$ does.
Definition: A partition of $\mathcal{X}$ is a collection of subsets of $\mathcal{X}$ such that $\cup_{\alpha} \mathcal{X}_\alpha = \mathcal{X}$ and
$\mathcal{X}_\alpha \cap \mathcal{X}_\beta = \emptyset$ unless the two elements of the partition are identical, that is , $\alpha=\beta$.