These types of conditions (exchangeability and $O(1/n)$ covariance) do not even ensure that $S_n$ is approximately normal.
Ignore odd $n$. (I believe similar examples can be constructed for odd $n$.) Let $n=2m$. Let each $X_{n,k} \in \lbrace -1,1 \rbrace$. Let each set of $m$ positive signs have probability ${2m \choose m}^{-1}$. Then the sum is always $0$. This satisfies conditions 1 and 2. $P(X_{n,1} = X_{n,2}) = {2m-2 \choose m}/{2m-1 \choose m} = \frac{m-1}{2m-1}$ so the covariance is $\frac{m-1}{2m-1} - \frac{m}{2m-1} = \frac{-1}{2m-1}$.
The point mass at $0$ is still normal, but degenerate. However, this example can be modified slightly so that the sum has positive standard deviation. With probability $\frac{1}{4m^2}$ let all $X_{n,i}$ be equal ($\frac{1}{8m^2}$ chance of all $+1$, $\frac{1}{8m^2}$ all $-1$), and with probability $\frac{4m^2-1}{4m^2}$ let the positive indices be a uniformly random subset of size $m$. Then the covariance $\text{Cov}(X_{n,1},X_{n,2})$ is $\frac{-1}{n}$, and $P(S_n =0) = \frac{n^2-1}{n^2}, P(S_n = \pm n) = \frac{1}{2n^2}$ so $\text{Var}(S_n) = 1$, and $\frac{S_n}{\sqrt{\text{Var}(S_n)}}$ is far from normal.
For your first question, Rigollet has a series of notes(with minor typos) that dicusses basics of this kind of tail bounds. The result you mentioned in the "introduction" is actually the classic Hoeffding bound that mentioned by Rigollet in this set of notes.
High Dimensional Statistics, Philippe Rigollet (2015)
http://www-math.mit.edu/~rigollet/PDFs/RigNotes15.pdf
If you are concerned with the order statistics, you usually want to restrict yourself to a narrower class of distributions, say the classic paper [2].
For a general distribution family, it is almost impossible to obtain a tail-bound on the order statistics due to the concentration of measures phenomenon. Another keyword you may want to look into is "U-statistics" because the (full) order statistic is an example of U-statistics, the following quote from [1] is the best description of the power of U-statistics
Theoretically, for these U-statistics we can study the whole spectrum
of asymptotic problems which were investigated for independent
variables. As a matter of fact, it is necessary to control the nature
of dependence in order to obtain meaningful results. We restrict
ourselves to exchangeable and weakly exchangeable variables, rank
statistics, samplings from finite populations, weakly dependent random
variables, bootstrap-variables, and to order statistics.[1]p.15
Concentration bounds on U-statistics is a research subject that involves many false claims and hardcore techniques, so I think it is not too hard to find one but never too careful to use one.
Update in response to update3.
Why is the maximal uniform spacing so small in magnitude compared to the maximal oscillation in the empirical distribution?
This is not something surprising. you can always have very large spacings but small Kolomogorov-Smirnov norm, which measures on the space of $\mathcal{M}(\mathbb{R})$ while the spacing is measuring $\mathbb{R}$. If you look into Luc's argument, you will see his comments on it.
For the simplest example, given a set of data $\{0,0.1,0.9\cdots,0.9\text{(n repeated 0.9)},1\}$ and $\{0,0.1\cdots,0.1\text{(n repeated 0.1)},0.9,1\}$, their empirical measures has completely different cdfs but they both have the same maximal spacing of $0.8$. as the number $n$ of repeated observation increase, the KS norm between these two cdfs can be arbitrarily close to 1.
Reference
[1]Korolyuk, Vladimir S., and Yu V. Borovskich. Theory of U-statistics. Vol. 273. Springer Science & Business Media, 2013.
[2] Gupta, S. Das, et al. "Inequalities on the probability content of convex regions for elliptically contoured distributions." Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability (Univ. California, Berkeley, Calif., 1970/1971). Vol. 2.1972.
Best Answer
$\newcommand{\de}{\delta} \newcommand{\De}{\Delta} \newcommand{\ep}{\epsilon} \newcommand{\ga}{\gamma} \newcommand{\Ga}{\Gamma} \newcommand{\la}{\lambda} \newcommand{\Si}{\Sigma} \newcommand{\thh}{\theta} \newcommand{\R}{\mathbb{R}} \newcommand{\E}{\operatorname{\mathsf E}} \newcommand{\PP}{\operatorname{\mathsf P}}$
Let $V_k:=1/\sqrt{X_k}$, $b_n:=\sqrt{n\ln n}$, \begin{equation*} Z_n:=\frac{S_n-2n}{\sqrt{n\ln n}}=\frac1{b_n}\sum_1^n (V_k-\E V_1), \end{equation*} \begin{equation*} \de_n:=\sup_{x\in\R}|\De_n(x)|, \end{equation*} where \begin{equation*} \De_n:=F_n-G,\quad F_n(x):= P(Z_n<x),\quad G(x):=P(Z<x), \end{equation*} and $Z\sim N(0,1)$.
We shall show that for all large enough $n$ \begin{equation*} \frac{\sqrt{\ln\ln n}}{\ln n}\ll\de_n\ll\ep_n:=\frac{\ln\ln n}{\ln n}; \tag{1} \end{equation*} here everywhere, the constants associated with $\ll$, $\gg$, $O(\cdot)$ are universal. Thus, we have a rather tight bracketing of $\de_n$. (Conjecture: $\de_n\asymp\frac{\ln\ln n}{\ln n}$.)
Let $c$ denote various complex-valued expressions (possibly different even within the same formula) such that $|c|\ll1$.
The pdf of $V_k$ is $x\mapsto\frac2{x^3}\,I\{x>1\}$, where $I$ denotes the indicator. Note that $|e^{iu}-1-iu|\le2|u|$ and $|e^{iu}-1-iu+u^2/2|\le|u|^3/6$ for real $u$. So, for the characteristic function (c.f.) $f_V$ of $V_k$ and $|t|\le1$ we have \begin{multline*} \frac12\,f_V(t)=\int_1^\infty\frac{e^{itx}}{x^3}\,dx =\int_1^\infty\frac{1+itx}{x^3}\,dx -\int_1^{1/|t|}\frac{t^2x^2/2}{x^3}\,dx \\ +\int_1^{1/|t|}\frac{e^{itx}-1-itx+t^2x^2/2}{x^3}\,dx +\int_{1/|t|}^\infty\frac{e^{itx}-1-itx}{x^3}\,dx \\ =\frac12+it-\frac{t^2}2\,\ln\frac1{|t|} +c\int_1^{1/|t|}\frac{|t|^3x^3}{x^3}\,dx +c\int_{1/|t|}^\infty\frac{|t|x}{x^3}\,dx \\ =\frac12+it-\frac{t^2}2\,\ln\frac1{|t|}+ct^2. \end{multline*} So, for $|t|\le1$ we have $\ln f_V(t)=2it-t^2\,\ln\frac1{|t|}+ct^2$ and hence for the characteristic function $f_n:=f_{Z_n}$ of $Z_n$ we have \begin{multline*} \ln f_n(t)=-i2nt/b_n+n\ln f_V(t/b_n) =-\frac{t^2}{\ln n}\,\ln\frac{\sqrt{n\ln n}}{|t|}+c\frac{t^2}{\ln n} \\ =-\frac{t^2}2-\frac{t^2}{\ln n}\,\Big(\frac12\,\ln\ln n-\ln|t|\Big)+c\frac{t^2}{\ln n} \\ =-\frac{t^2}2+\frac{t^2}{\ln n}\,\ln|t|+c\frac{t^2}{\ln n}\,\ln\ln n \tag{2} \end{multline*} for $|t|\le b_n=\sqrt{n\ln n}$. So, with $\ep_n$ as in (1), \begin{equation*} \ln f_n(t)= \begin{cases} -\frac{t^2}2+c\ep_n|t| & \text{ if }|t|\le1 \\ -\frac{t^2}2+c\ep_n t^2\ & \text{ if }1\le|t|\le\ln n, \end{cases} \end{equation*} whence \begin{multline*} \int_{|t|<\ln n}\frac{|f_n(t)-e^{-t^2/2}|}{|t|}\,dt \\ \le \int_{|t|<1}\frac{|e^{c\ep_n|t|}-1|}{|t|}\,dt +\int_{\R}\frac{|e^{-(1-2c\ep_n)t^2/2}-e^{-t^2/2}|}{|t|}\,dt \le c\ep_n; \end{multline*} the latter integral was bounded using the identity $\int_0^\infty\frac{e^{-at^2/2}-e^{-t^2/2}}t\,dt=\ln(1/\sqrt a)$ for $a>0$. By the Esseen smoothing inequality (see e.g. formula (6.4)), \begin{equation*} \de_n\le c\int_{|t|<\ln n}\frac{|f_n(t)-e^{-t^2/2}|}{|t|}\,dt +c/\ln n. \end{equation*} Now the second inequality in (1) immediately follows.
It remains to prove the first inequality in (1). For real $t$ and real $A>0$, \begin{multline*} \int_0^\infty e^{itx}d\De_n(x)=\int_0^A e^{itx}d\De_n(x)+c(1-F_n(A))+c(1-G(A)) \\ =c\De_n(A)+c\De_n(0)-it\int_0^A e^{itx}\De_n(x)dx+2c(1-G(A))+c\de_n \\ =c\de_n+c|t|A\de_n+ce^{-A^2/2} =c\de_n+c|t|\de_n\sqrt{\ln\frac1{\de_n}} \end{multline*} if $A=\sqrt{2\ln\frac1{\de_n}}$. Similarly estimating $\int_{-\infty}^0 e^{itx}d\De_n(x)$, we have \begin{equation*} f_n(t)-e^{-t^2/2}=\int_{-\infty}^\infty e^{itx}d\De_n(x)=c\de_n+c|t|\de_n\sqrt{\ln\frac1{\de_n}}. \end{equation*} Letting $t=1$ here and in (2), we see from the second line in (2) that $$\de_n\sqrt{\ln\frac1{\de_n}}\gg\frac{\ln\ln n}{\ln n},$$ whence the first inequality in (1) immediately follows.
It appears that similar techniques should work for a somewhat wide class of distributions, like the ones referenced by the OP.