I finally found a proof I understood. I took it from: Billingsley "Probability and Measure". In order to be thorough, I reproduce the full argument here.
Thm: If $X_n$ is a sequence of random variables for which:
The MGF is defined for $t \in [-r,r]$
The MGF converges pointwise for $t \in [-r,r]$ to the MGF of $X$
then $X_n \rightarrow X$ in weak convergence, and further all moments of $X_n$ converge to the corresponding moment of $X$
Proof: First we prove that the sequence $X_n$ is tight. Since $E( \exp(-r X_n) + \exp(r X_n) )$ converges, it's bounded. From this boundedness we can prove tightness of the sequence $X_n$.
Since the sequence is tight, we can extract a subsequence $X_{n_k}$ which converges weakly to some limit $\tilde X$. By continuity, $\tilde X$ has a MGF which is equal to that of $X$. Since the MGF characterizes a random variable $\tilde X = X$ (see lemma below)
Every convergent subsequence converges to $X$ and $X_n$ is tight, so $X_n \rightarrow X$ weakly (Billingsley Thm 25.10, corollary)
Lemma: the MGF characterizes the distribution of a random variable: If $X$ and $Y$ have the same MGF for $t \in [-r,r]$, then $X$ and $Y$ have the same distribution
Proof: If the MGF is defined over $[-r,r]$, then it is analytical over $]-r,r[$. We can then extend it to the complex plane for $Re(z) \in ]-r,r[$ and this extension is unique. Note $\psi(z)$ that extension. $\phi(t)=\psi(it)$ is the characteristic function of $X$ which uniquely determines it.
In a sense, an MGF is simply a way of encoding a set of moments into a convenient function in a way that you can do some useful things with the function.
The variable $t$ in no way relates to the random variable $X$. You could as readily write $M_X(s)$ or $M_X(u)$... it is, in essence a kind of dummy variable. It doesn't stand for anything beyond being the argument of the mgf.
Herbert Wilf [1] calls a generating function:
a clothesline on which we hang up a sequence of numbers for display
It really wouldn't matter which exact clothesline you hung them on; another would do just as well.
Is there any way to derive the functions from anywhere?
There's more than one way to turn a set of moments into a generating function (e.g. a discrete distribution has a probability generating function, a moment generating function, a cumulant generating function and a characteristic function and you can recover the moments (in some cases less directly than others) from any of them.
So there's not a unique way to encode a set of moments into a function; it's a matter of choice about how you set it up. While they're similar (and, naturally, related), some are more convenient for particular kinds of tasks.
I see a certain analogy between mgf and Laplace transform and cf and Fourier transform.
Not merely an analogy, at least if we consider the bilateral Laplace transform (which I'll still denote as $\mathcal{L}$ here). We see $M_X(t) = \mathcal{L}_X(-t)$ is (at least up to a change of sign) really a Laplace transform (indeed, consider $\mathcal{L}_X(-t) =\mathcal{L}_{-X}(t)$, so it's the bilateral Laplace transform of a flipped variate). One can convert readily from one to the other, and use results for Laplace transforms on mgfs quite happily (and, for that matter, tables of Laplace transforms, if we keep that sign issue in mind). Similarly, characteristic functions are not merely analogous to Fourier transforms, they are Fourier transforms (again, up to the sign of the argument which is of no consequence outside the obvious effect swapping the sign of the argument has on a function).
If Fourier transforms and Laplace transforms help give you intuition about what mgfs and cfs "are" you should certainly exploit those intuitions, but on the other hand, it's not always necessary to have intuition when manipulating these things.
In fact when playing with cfs, because they always exist and are unique, I often tend to think of them as just the distribution looked at through a different lens.
I can see that taking the derivative of the function and evaluating at t=0 gives the moment (if the integral is absolutely convergent), but why?
Because the particular generating function we chose to use (the mgf) was set up to work that way. In order to be able to extract the set of moments from the function again you need something like that -- a way to eliminate all the lower ones (such as differentiation) and eliminate all the higher ones (such as set the argument to 0) so that you can pick out the exact one you need. For that to happen you already need something that works kind of like an mgf. At the same time, it's nice if it has some other properties you can exploit (as the various generating functions we use with random variables do), so that restricts our set of choices even further.
[1] Wilf, H. (1994)
generatingfunctionology, 2nd ed
Academic Press Inc., San Diego
https://www.math.upenn.edu/~wilf/DownldGF.html
Best Answer
As I recall in this version the random variables are independent with finite variances but the variance need not all be the same. The CLT result holds under a somewhat complicated condition called the Lindeberg condition and the traditional proofs use transform methods.
But the proof we learned was probabilistic. It involved splitting the sum into two pieces. One piece converged to N(0,1) in distribution and the other converge to 0 in probability. This technique was used because it was much easier to show the first sum satisfied the CLT. But the fact that the second sum was negligible was harder. The following link gives an interesting paper by Larry Goldstein that give a probabilistic proof of the Linderberg Feller Theorem that is very similar or the same. It also may be of interest to the OP because it includes some history on the CLT. http://bcf.usc.edu/~larry/papers/pdf/lin.pdf