Consider the second tentative statement by the OP, slightly modified,
$$\forall \theta\in \Theta, \epsilon>0, \delta>0, S_n, \exists n_0(\theta, \epsilon, \delta): \forall n \geq n_0,\;\\P_n\big[|{\hat \theta(S_{n}}) - \theta^*|\geq \epsilon \big] < \delta \tag{1}$$
We are examining the bounded in $[0,1]$ sequence of real numbers
$$\big\{ P_n\big[|{\hat\theta(S_{n}}) - \theta^*|\geq \epsilon \big]\big\}$$
indexed by $n$. If this sequence has a limit as $n\rightarrow \infty$, call it simply $p$, we will have that
$$\forall \theta\in \Theta, \epsilon>0, \delta>0, S_n,\,\exists n_0(\theta, \epsilon, \delta): \forall n \geq n_0,\;\\\Big| P_n\big[|\hat{\theta(S_{n}}) - \theta^*|\geq \epsilon \big] -p\Big|< \delta \tag{2}$$
So if we assume (or require) $(1)$, we essentially assume (or require) that the limit as $n\rightarrow \infty$ exists and is equal to zero, $p=0$.
So $(1)$ reads "the limit of $P_n\big[|\hat{\theta(S_{n}}) - \theta^*|\geq \epsilon\big]$ as $n\rightarrow \infty$ is $0$". Which is exactly the current definition of consistency (and yes, it covers "all possible samples")
So it appears that the OP essentially proposed an alternative expression for the exact same property, and not a different property, of the estimator.
ADDENDUM (forgot the history part)
In his "Foundations of the Theory of Probability" (1933), Kolmogorov mentions in a footnote that (the concept of convergence in probability)
"...is due to Bernoulli;its completely general treatment was
introduced by E.E.Slutsky".
(in 1925). The work of Slutsky is in German -there may be even an issue of how the German word was translated in English (or the term used by Bernoulli). But don't try to read too much into a word.
In that paragraph the authors are giving an extreme example to show how being unbiased doesn't mean that a random variable is converging on anything.
The authors are taking a random sample $X_1,\dots, X_n \sim \mathcal N(\mu,\sigma^2)$ and want to estimate $\mu$. Noting that $E(X_1) = \mu$, we could produce an unbiased estimator of $\mu$ by just ignoring all of our data except the first point $X_1$. But that's clearly a terrible idea, so unbiasedness alone is not a good criterion for evaluating an estimator. Somehow, as we get more data, we want our estimator to vary less and less from $\mu$, and that's exactly what consistency says: for any distance $\varepsilon$, the probability that $\hat \theta_n$ is more than $\varepsilon$ away from $\theta$ heads to $0$ as $n \to \infty$. And this can happen even if for any finite $n$ $\hat \theta$ is biased. An example of this is the variance estimator $\hat \sigma^2_n = \frac 1n \sum_{i=1}^n(y_i - \bar y_n)^2$ in a normal sample. This is biased but consistent.
Intuitively, a statistic is unbiased if it exactly equals the target quantity when averaged over all possible samples. But we know that the average of a bunch of things doesn't have to be anywhere near the things being averaged; this is just a fancier version of how the average of $0$ and $1$ is $1/2$, although neither $0$ nor $1$ are particularly close to $1/2$ (depending on how you measure "close").
Here's another example (although this is almost just the same example in disguise). Let $X_1 \sim \text{Bern}(\theta)$ and let $X_2 = X_3 = \dots = X_1$. Our estimator of $\theta$ will be $\hat \theta(X) = \bar X_n$. Note that $E \bar X_n = p$ so we do indeed have an unbiased estimator. But $\bar X_n = X_1 \in \{0,1\}$ so this estimator definitely isn't converging on anything close to $\theta \in (0,1)$, and for every $n$ we actually still have $\bar X_n \sim \text{Bern}(\theta)$.
Best Answer
Suppose $X_1, X_2, \cdots X_n \stackrel{iid}{\sim} N(\mu, 1)$ and our goal is to estimate $\mu$. Consider the estimator $$\hat\mu_n = \begin{cases} \bar X_n, & \text{with probability $\frac{n-1}{n}$} \\[1.2ex] n, & \text{with probability $\frac{1}{n}$} \end{cases}$$
By introducing $Y_n \sim Bern\left(\frac{n-1}{n}\right)$ with $Y_i \perp \!\!\! \perp Y_j$ (for all $i\neq j$) and $Y_i \perp \!\!\! \perp X_j$ (for all $i$ and $j$) we can rewrite this estimator as $$\hat\mu_n = \bar X_n Y_n + n(1-Y_n) .$$
This estimator is simply consistent.
To see this, consider \begin{align*} P(|\hat\mu_n - \mu| < \epsilon) &= P(|\bar X_n - \mu| < \epsilon)\frac{n-1}{n} + P(|n-\mu|<\epsilon)\frac{1}{n} \\[1.3ex] &=\frac{n-1}{n}\left\{\Phi\left(\sqrt n \epsilon \right)- \Phi\left(-\sqrt n \epsilon \right) \right\} + P(|n-\mu|<\epsilon)\frac{1}{n} \end{align*}
Taking the limit gives $$\lim_{n\rightarrow\infty} P(|\hat\mu_n - \mu| < \epsilon) = 1,$$ and thus $\hat\mu_n \stackrel{p}{\rightarrow} \mu$.
This estimator is not MSE-consistent
We start by finding the bias.
\begin{align*} E(\hat\mu_n) &= E(\bar X_n Y_n + n(1-Y_n)) \\[1.2ex] &= E(\bar X_n)E(Y_n) + nE(1-Y_n) \\[1.2ex] &= \mu\frac{n-1}{n} + n\frac{n-1}{n} \\[1.2ex] &= \frac{n-1}{n}\mu + 1 \end{align*}
Therefore $B_\mu(\hat\mu_n) = 1 - \mu/n$ which does not converge to $0$ as $n \rightarrow \infty$. Since $Var(\hat\mu_n)$ must be non-negative this is enough to conclude that $MSE(\hat\mu_n) \not\rightarrow 0$, and therefore the estimator is not MSE-consistent.