The symbols $\boldsymbol{x,y}$ refer to data. Associated with each possible value $v$ of the statistic $S$ is a collection of possible data values $S^{-1}(v)$ for which $S$ has the value $v$. Since $T$ has the same value (call it $w$) on every such dataset, we may define $H(v) = w$.
The case $S(\boldsymbol{x}) \ne S(\boldsymbol{y})$ is irrelevant to the theorem.
Let's interpret the theorem. Say that two datasets $\boldsymbol{x,y}$ are equivalent if the relative likelihood function
$$\theta \to \frac{L(\theta,\boldsymbol{x})}{L(\theta,\boldsymbol{y})}$$
is constant. This means that any analysis based on comparing likelihoods (for different values of $\theta$) will not make any distinction between any two equivalent datasets. The theorem informs us that a minimal sufficient statistic will not ever distinguish between two equivalent datasets (that is, it must have the same value on each).
The proof of the theorem proceeds by noting that any two datasets having the same value of $S$ must be equivalent (provided that $S$ is sufficient) and therefore $T$ will have the same value on those datasets.
We might picture this by supposing that this equivalence relation among datasets partitions $\Omega$ into separate, overlapping components, each being a collection of equivalent datasets. Sufficient statistics have different values on different components: this guarantees that they can discriminate among inequivalent datasets. However, their values within any given component might vary (thereby discriminating among some equivalent datasets, too). Any minimal sufficient statistic, though, will be constant on each component: it will not discriminate between two equivalent datasets.
The following is a formal mathematical demonstration that $T$ is a function of $S$.
Let the set of all possible such data be $\Omega$. A statistic, such as $S$ or $T$, assigns some kind of mathematical object to each dataset $\boldsymbol{x}\in\Omega$--such as a number or vector--that we can calculate with. The details don't matter, so suppose $S$ assigns objects in a set $V$ and $T$ assigns objects in a set $W$. If the function $H$ exists, then it is a map from $V$ to $W$ (depending on $S$, by the way):
$$\begin{array}{rrcll}
\; &&\Omega \\
\; &^S\swarrow & &\searrow^T \\
\; V & &\xrightarrow{H} &&W \\
\end{array}$$
Given $T$ and $S$ as in the question, what we know is that
For all $\boldsymbol{x,y}\in\Omega$, $S(\boldsymbol{x}) = S(\boldsymbol{y})$ implies $T(\boldsymbol{x}) = T(\boldsymbol{y})$.
From this we would like to deduce the existence of a function $H:V\to W$ such that $T = H\circ S$: that is, $T(\boldsymbol{x}) = H(S(\boldsymbol{x}))$ for all $\boldsymbol{x}\in\Omega$.
$H$ can be found with an explicit construction. One way begins with a function $H^{*}$ defined on $V$ whose values are subsets of $W$ defined as
$$H^{*}(v) = T(S^{-1}(v)) = \{T(\boldsymbol{x})\,|\, S(\boldsymbol{x}) = v\}.$$
I claim that all elements of $H^{*}(v)$ are equal, no matter what $v\in V$ might be. To prove the claim let $u, w\in H^{*}(v)$. By definition, this means there are $\boldsymbol{x,y}\in\Omega$ such that $u=T(\boldsymbol{x})$ and $w=T(\boldsymbol{y})$ and $S(\boldsymbol{x}) = S(\boldsymbol{y}) = v$. The latter equality implies $u = T(\boldsymbol{x}) = T(\boldsymbol{y}) = w$, proving the claim.
This claim enables us to define $H(v)$ whenever $H^{*}(v)$ is nonempty: it's the unique element of $H^{*}(v)$. To complete the definition of $H$, pick an arbitrary $w_0\in W$ and set $H(v) = w_0$ when $H(v)$ is empty. Formally,
$$H(v) = \left\{ \begin{array}{ll}
w & \text{if } H^{*}(v) = \{w\}\\
w_0 & \text{if } H^{*}(v) = \emptyset.
\end{array} \right. $$
Let the sample space be $\mathcal{X}$. Then a sufficient statistic $T$ can be seen as indexing a partition of $\mathcal{X}$, that is, $T(x)=T(y)$ iff (if and only if) $x,y$ belongs to the same element of the partition. A minimallly sufficient statistic is then giving a maximal reduction of the data. That is to say, if $T$ is minimally sufficient, then if we take the partition corresponding to $T$, take two distinct elements of that partition, and makes a new partition by replacing the two by their union, the resulting statistic is not longer sufficient. So, any other sufficient statistic, say $S$, which is not minimal, will have a partition which corresponds to a refinement of the partition of $T$, that is, every element of the partition of $T$ is a union of elements of the partition of $S$ (this becomes easier to understand if you make a drawing from my text!). So, when you know the value of $S$, you know in which element of the partition of $S$ that sample point belongs, and also in which element of the partition of $T$ that sample point belongs — since that partition is coarser. That is what it means when it says that $T$ is a function of every other sufficient statistic — every other sufficient statistic gives more information (or the same information) about the sample than what $T$ does.
Definition: A partition of $\mathcal{X}$ is a collection of subsets of $\mathcal{X}$ such that $\cup_{\alpha} \mathcal{X}_\alpha = \mathcal{X}$ and
$\mathcal{X}_\alpha \cap \mathcal{X}_\beta = \emptyset$ unless the two elements of the partition are identical, that is , $\alpha=\beta$.
Best Answer
As you stated:
So, for example, in the case where $X_1, ...., X_n$ are independent Bernoulli random variables. We can prove that $(x_1, ...., x_n)$ is not minimally sufficient by showing that it is not a function of $\sum x_i$. This is obvious, since the function must map $1$ to both $(1,0,0...,0,0,0)$ and $(0,0,0...,0,0,1)$.