this response addresses part of grautur's question: how to think about the property of being "sufficient".
ra fisher suggested the terminology "sufficient" for a statistic $t$ that satisfies the [heuristic] requirement $$\kern{-2.5in}(1)\kern{2.5in} f(x|\theta, t) = f(x|t).$$
[there is nothing heuristic about this way of writing the requirement if $x$ is discrete. but if $x$ is continuous, the pdf notation $f(x|t)$ is usually heuristic. an example of the latter is when $x = (x_1,\dots x_n)$, where the $\{x_i\}$ are $iid$ N($\theta$,1). here the sample mean $\bar x$ is a sufficient statistic and $(1)$ requires considering the conditional pdf $$\kern{-.5in} (2)\kern{.5in} f(x_1,\dots, x_n|\bar x) \equiv f(x_1 - \bar x,\dots, x_n - \bar x | \bar x) = f(x_1 - \bar x,\dots, x_n - \bar x).$$ the last equality in $(2)$ holds since the sample deviations are independent of the sample mean in the normal case. (this fact gives an instant proof that the sample mean and variance are independent in the normal case.) here the "pdf" on the RHS of (2) does not exist as an $n-$dimensional pdf since the joint distribution of the sample deviations is singular - they sum to zero. replacing "$f$" in (1) by "dist", removes its heuristic nature.]
at any rate, the idea of (1) is that, as the conditional distribution of $x|t$ does not depend on $\theta$, one can [in principle] generate a new $x$ - call it $x^*$, say, from the known conditional distribution of $x|t$, so that $x^* \sim x$ for all $\theta$.
as an illustration, consider the normal example above. here, obtaining an $x^*$ is particularly easy since the deviations are independent of $\bar x$. thus one can generate $n$ iid N(0,1) variables $z_1,\dots, z_n$ and let $x^* = (x_1^*,\dots. x_n^*)$, where $x_i^* = z_i - \bar z + \bar x : 1\le i\le n$.
clearly $x \sim x^*$ for all values of $\theta$, so that $x^*$ is just as "good" a sample from the population as the original $x$ for learning about $\theta$. clearly no one would actually want to use the sample deviations for $x^*$ in addition to its sufficient statistic $\bar x^* = \bar x$, as those deviations, the $\{z_i - \bar z\}$, were obtained by a completely extraneous random experiment having nothing to do with the actual data process. one then should agree that the original sample deviations $\{x_i - \bar x\}$ should also not be used, since they can be considered as being generated in the same way, by an extraneous random experiment, where now nature, rather than the statistician, did the generating.
This is a late answer to the question, but given the ubiquity of the Casella and Berger text, it seems worthy of an answer.
My copy of Casella and Berger reads the same as yours: monotone likelihood ratio is explicitly defined as either a non-increasing or non-decreasing ratio in Definition 8.3.16, but the statement of Karlin–Rubin in Theorem 8.3.17 is not consistent with this definition. This allows for precisely the sort of contradictory results you describe.
You'll note that the proof of the theorem given in the text is actually dependent on an increasing ratio, as seen in this line:
$$
T > t_0 \iff \frac{g(t|\theta')}{g(t|\theta_0)} > k'
$$
This line is corrected to the following in the errata, but this correction is unrelated to your question and still assumes a non-decreasing ratio:
$$
\left\{ t : \frac{g(t|\theta')}{g(t|\theta_0)} > k' \right \} \subset \{t : t > t_0\} \subset \left\{ t : \frac{g(t|\theta')}{g(t|\theta_0)} \geq k' \right \}
$$
I believe that the definition of MLR given there is wrong ...
I don't think the MLR definition is necessarily wrong. Looking around, I see definitions of MLR that require various forms of monotonicity—some require monotonicity in a certain direction and some don't.
An example of a textbook that defines MLR similarly to Casella and Berger is Introduction to Mathematical Statistics by Hogg, McKean, and Craig, where the ratio is allowed to be non-decreasing or non-increasing. (The ratio they consider is the reciprocal of the one given in Casella and Berger, but both allow for monotonicity in either direction.) The construction of a UMP test is then described for the case when the ratio is decreasing, with a note stating that the inequalities flip if the ratio is increasing:
Assume that our likelihood function $L(\theta, \mathbf{x})$ has a monotone decreasing likelihood ratio in the statistic $y = u(\mathbf{x})$. Then the ratio in (2.2) is equal to $g(y)$, where $g$ is a decreasing function. The case where the likelihood function has a monotone increasing likelihood ratio (i.e., $g$ is an increasing function) follows similarly by changing the sense of the inequalities below.
In short, the direction of the inequality you need for your rejection region is dependent on the direction of the MLR. Unfortunately, the theorem in Casella and Berger doesn't directly state this.
Best Answer
They are trying to clarify that the sufficiency principle and the likelihood principle are similarly structured. They state a general template for a principle as follows. (I will use $S$ instead of $T$ so as not to conflate with the notation used when stating the sufficiency principle.)
For different choices of $S$ you will get different principles. The authors explain how the sufficiency principle and the likelihood principles are special cases of the above general principle for certain choices of $S$.