I wrote a more detailed derivation of the soft-thresholding operator, following the source you mention and other ones. I hope it helps.
The soft-thresholding is just the proximal mapping of the $l_1$-norm. Let $f(x) = \lambda\|x\|_1$, then the proximal mapping of $f$ is defined as
\begin{equation}
\operatorname{prox}_f(x) = \operatorname{argmin}_z\{\frac{1}{2}\|x-z\|^2_2 + \lambda\|z\|_1 \}
\end{equation}
The optimality condition for the previous problem is
\begin{equation}
0 \in \nabla(\frac{1}{2}\|x-z\|^2_2) + \partial(\lambda\|z\|_1) \Leftrightarrow 0 \in z-x + \lambda\partial\|z\|_1
\end{equation}
The $l_1$-norm is separable and thus we can consider each of its components separately. Let's examine first the case where $z_i \neq 0$. Then, $\partial \|z_i\|=\operatorname{sign}(z_i)$ and the optimum $z_i^*$ is obtained as
\begin{equation}
0 = z_i-x_i + \lambda \operatorname{sign}(z_i) \Leftrightarrow z_i^* = x_i - \lambda \operatorname{sign}(z_i^*)
\end{equation}
Note also that if $z_i^* < 0$, then $x_i < -\lambda$ and equivalently if $z_i^* > 0 \Rightarrow x_i > \lambda$. Thus, $|x_i| > \lambda$ and $\operatorname{sign}(z_i^*) = \operatorname{sign}(x_i)$. Substituting in the previous equation we get
\begin{equation}
z_i^* = x_i - \lambda \operatorname{sign}(x_i)
\end{equation}
In the case where $z_i = 0$, the subdifferential of the $l_1$-norm is the interval $[-1,1]$ and the optimality condition is
\begin{equation}
0 \in -x_i + \lambda[-1,1] \Leftrightarrow x_i \in [-\lambda,\lambda] \Leftrightarrow |x_i| \leq \lambda
\end{equation}
Putting all together we get
\begin{equation}
[\operatorname{prox}_f(x)]_i = z_i^* = \left\{ \begin{array}{lr} 0 & \text{if } |x_i| \leq \lambda \\ x_i - \lambda \operatorname{sign}(x_i) & \text{if } |x_i| > \lambda \end{array}\right.
\end{equation}
The previous equation can also be written as
\begin{align*}
[\operatorname{prox}_f(x)]_i &= \operatorname{sign}(x_i)\max(|x_i|-\lambda, 0) \\
&= \operatorname{sign}(x_i)(|x_i|-\lambda)_+
\end{align*}
where $(\cdot)_+$ denotes the positive part.
Okay, the proof is actually simple from this point. Choose the domain $\lambda \in (0, \frac{2}{\beta})$. This is chosen as at $\frac{2}{\beta}$, the map becomes non-expansive (this is easy to verify).
Recall the bound we had earlier:
$$
\|Tx - Ty \|^{2} \leq \left(1 - 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x - y\|^{2} + \left(\lambda^{2} - \frac{2\lambda}{\alpha + \beta}\right)\|\nabla f(x) - \nabla f(y)\|^{2}
$$
When $\lambda \leq \frac{2}{\alpha +\beta}$, the second term on the right-hand side is nonpositive and thus we can replace $\|\nabla f(x) - \nabla f(y)\|$ with a lower bound. We will use the lower bound that follows trivially from $\alpha$-strong convexity:
$$
\|\nabla f(x) - \nabla f(y)\| \geq \alpha \|x - y\|
$$
Plugging this in then gives:
$$
\|Tx - Ty \|^{2} \leq \left(1 - 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x - y\|^{2} + \left(\lambda^{2}\alpha^{2} - \frac{2\alpha^{2}\lambda}{\alpha + \beta}\right)\|x - y\|^{2}
$$
Rearranging and simplifying then yields:
$$
\|Tx - Ty\|^{2} \leq (\lambda{2}\alpha^{2} - 2\lambda\alpha + 1)\|x-y\|^{2}
$$
Now factorising we have
$$
\|Tx - Ty\|^{2} \leq (1- \lambda\alpha)^{2}\|x-y\|^{2} \iff \|Tx - Ty\| \leq (1- \lambda\alpha)\|x - y\|
$$
Note that when $\lambda \leq \frac{2}{\alpha + \beta}$, that $1 - \lambda\alpha = \max\{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$:
- $1 - \lambda\alpha \geq 1- \lambda\beta$ follows from $\beta > \alpha$
- $1 - \lambda\alpha \geq \lambda\beta - 1 \iff 2 \geq \lambda(\alpha + \beta) \iff \frac{2}{\alpha + \beta} \geq \lambda$ which is true by assumption.
- $1 - \lambda\alpha \geq \alpha\lambda - 1$ follows from the above fact and the fact that $\beta > \alpha$.
Thus we have proved that gradient descent is a $\max \{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$ contraction for all $\lambda \leq \frac{2}{\alpha+\beta}$. For $\lambda \geq \frac{2}{\alpha +\beta}$, note that the second term on the RHS of our inequality is nonnegative, and thus we can replace $\|\nabla f(x) - \nabla f(y)\|$ with an upper bound using $\beta$-smoothness:
$$
\|\nabla f(x) - \nabla f(y)\| \leq \beta \|x - y\|
$$
Plugging this in yields:
$$
\|Tx - Ty \|^{2} \leq \left(1 - 2\lambda\left(\frac{\alpha\beta}{\alpha + \beta}\right)\right)\|x - y\|^{2} + \left(\lambda^{2}\beta^{2} - \frac{2\beta^{2}\lambda}{\alpha + \beta}\right)\|x - y\|^{2}
$$
Simplifying and factorising as before we find that:
$$
\|Tx - Ty\|^{2} \leq (\lambda\beta - 1)^{2}\|x-y\|^{2} \iff \|Tx - Ty\| \leq (\lambda\beta - 1)\|x - y\|
$$
Similarly to before, when $\lambda \geq \frac{2}{\alpha + \beta}$, we see that $\lambda\beta - 1 = \max \{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$. Thus we have shown that gradient descent is a $\max \{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$ contraction for all $\lambda$ such that $\frac{2}{\beta} > \lambda \geq \frac{2}{\alpha + \beta}$.
Combining both cases ($\lambda \geq \frac{2}{\alpha+\beta}$, $\lambda \leq \frac{2}{\alpha +\beta}$) we see that gradient descent is a $\max \{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$ contraction for all $\lambda \in \left(0, \frac{2}{\beta}\right)$, i.e.
$$
\|Tx - Ty\| \leq \max\{|1 - \lambda\alpha|, |1 - \lambda\beta|\}\|x - y\|
$$
It now also immediately obvious why $\frac{2}{\beta}$ is an upper bound on the step size. Setting $\lambda \geq \frac{2}{\alpha + \beta}$ we have
$$
\lambda\beta - 1 = \max\{|1 - \lambda\alpha|, |1 - \lambda\beta|\}
$$
But
$$
\lambda\beta - 1 < 1 \iff \lambda < \frac{\beta}{2}
$$
So bounding $\lambda$ above by $\frac{2}{\beta}$ ensures $\max\{|1 - \lambda\alpha|, |1 - \lambda\beta|\}$ is less than 1.
Best Answer
You already found $$||\frac{1}{n} {X^{(k)}}^T r_{(-k)} - \alpha \lambda [-1,1]||_2 \leq (1-\alpha)\lambda, \qquad (1) $$ where you can choose any value in $[-1,1]$ to satisfy the inequality, for each element of the vector separately. Let us see how you would select this value for the $j^{th}$ component. The goal is to satisfy inequality (1), so you want the component to be close to $0$. The value you select depends on $$\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j \quad (2)$$ You select $1$ (from the interval $[-1,1]$ when (2) is larger than $\alpha\lambda$, $-1$ when (2) is smaller than $-\alpha\lambda$, and (2) divided by $\alpha\lambda$ when (2) is in $[-\alpha\lambda,\alpha\lambda]$. The last displayed formula in your question does not take the last case into account, while the $(\cdot)_+$ operator does.
When (2) is larger than $\alpha\lambda$, the $j^{th}$ component equals: $$\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j-\alpha\lambda.$$ When (2) is smaller than $\alpha\lambda$, the $j^{th}$ component equals: $$\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j+\alpha\lambda.$$ When (2) is in $[-\alpha\lambda,\alpha\lambda]$, the $j^{th}$ component equals $0$. For each of these cases, it is easy to see that they equal $$\text{sign}\left(\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j\right)\left(\left|\frac{1}{n}\left({X^{(k)}}^T r_{(-k)}\right)_j\right|- \alpha \lambda\right)_+.$$ Note that the sign operator can be omitted, since it does not affect the value of the 2-norm in (1).