There's a short-cut in Roth's approach if one only cares to get $o(N)$. Adolf Hildebrand told me so, and here is my shortest writeup.
Notation: Let $r(N),\rho(N)$ be the largest cardinality and density of a subset of $[N]$ that is free of 3-term APs, and let $\rho=\lim \rho(N)$, which must exist by Fekete's subadditive lemma since $r(N+M)\leq r(N)+r(M)$. Let $A\subset [N]$ witness $r(N)$, and let $S(x) = \sum_{a\in A} e( a x)$ (where $e(ax)=\exp(2\pi i a x)$, naturally). Let $T(x)=\sum_{n=1}^N e(nx)$.
Lemmata: Now $r(N)=|A|=I:=\int_0^1 S(x)^2 S(-2x)dx$, since $A$ is 3-free. By Parseval, $|A|=\int_0^1 |S(x)|^2dx$. By expanding $S(x)^2 T(-2x)$ and exchanging integration and summation, $(|A|/2)^2 \leq I_0:=\int_0^1 S(x)^2 T(-2x)$.
Main Engine: As long as $0< M\leq N$, $$\sup_{x\in {\mathbb R}} |S(x)-\rho(M) T(x)| \leq N(\rho(M)-\rho(N)) + 10 M \sqrt{N}.$$ Proof is by circle method; set $E(x):=S(x)-\rho(M)T(x)$. Three steps are
- $|E(a/q)|\leq N(\rho(M)-\rho(N))+2Mq$, the hard one;
- $|E(a/q+\beta)| \leq N(\rho(M)-\rho(N))+2Mq +2\pi|\beta|NMq$;
- $|E(x)| \leq N(\rho(M)-\rho(N))+10M\sqrt{N}$.
Main Lemma: By Main Engine and Lemmata, $\Delta:=|I-\rho(M)I_0|$ is at most $(N(\rho(M)-\rho(N))+10M\sqrt{N})|A|$. By Lemmata, $\Delta$ is at least $|A|(\rho(M)\rho(N)N/4-1)$. Therefore, $$\rho(N)\rho(M)\leq 4(\rho(M)-\rho(N))+50M/\sqrt{N}.$$
Conclusion: Let $N\to\infty$ and then $M\to\infty$ to get $\rho^2\leq 0$.
The step labeled "the hard one" in the Main Engine is tricky, and uses the fact that we can restrict $A$ to arithmetic progressions and still get a 3-free set. The length of the progressions we restrict to is connected to $M$.
I suppose the point is that a write-up can be almost arbitrarily short or extremely long, depending on what the reader can be trusted to fill in.
The non-effectivity, as far as I understand, is already present in Thue's Theorem, thus to understand it, one can look a the proof of the latter. The issue is roughly that, to show that there are not many "close rational approximations" $p/q$, one starts with the assumption that there exists one very close one $p_0/q_0$, and show that this very good approximation "repulses" or excludes other similarly or better ones. This of course doesn't work if the first $p_0/q_0$ doesn't exist... but such an assumption also gives the result! The ineffectivity is that we have no way of knowing which of the two alternatives has led to the conclusion.
There is a well-known analogy with the Siegel (or Landau-Siegel) zero question in the theory of Dirichlet $L$-functions. Siegel -- and it is certainly not coincidental that this is the same Siegel as in Thue--Siegel--Roth, though Landau did also have crucial ideas in that case -- proved an upper bound for real-zeros of quadratic Dirichlet $L$-functions by (1) showing that if there is one such $L$-function with a zero very close to $1$, then this "repulses" the zeros of all other quadratic Dirichet $L$-functions (this phenomenon is fairly well-understood under the name of Deuring-Heilbronn phenomenon), thus obtaining the desired bound; (2) arguing that if the "bad" $L$-function of (1) did not exist, then one is done anyway.
Here the ineffectivity is clear as day: the "bad" character of (1) is almost certainly non-existent, because it would violate rather badly the Generalized Riemann Hypothesis. But as far as we know today, we have to take into account the possibility of the existence of these bad characters... a possibility which however does have positive consequences, like Siegel's Theorem...
(There's much more to this second story; an entertaining account appeared in an article in the Notices of the AMS one or two years ago, written by J. Friedlander and H. Iwaniec.)
Best Answer
Here is an extract from van der Poorten's "Obituary. Kurt Mahler (1903--1988)" (see p.353 in J. Austral. Math. Soc. Ser. A 51 (1991)):
This is to say that Roth's argument is more superior than Mahler's but it appeared some 20 years later...