The general proof of this can be found in Feller (An Introduction to Probability Theory and Its Applications, Vol. 2). It is an inversion problem involving Laplace transform theory. Did you notice that the mgf bears a striking resemblance to the Laplace transform?. For use of Laplace Transformation you can see Widder (Calcus Vol I) .
Proof of a special case:
Suppose that X and Y are random varaibles both taking only possible values in {$0, 1, 2,\dots, n$}.
Further, suppose that X and Y have the same mgf for all t:
$$\sum_{x=0}^ne^{tx}f_X(x)=\sum_{y=0}^ne^{ty}f_Y(y)$$
For simplicity, we will let $s = e^t$
and we will define $c_i = f_X(i) − f_Y (i)$ for $i = 0, 1,\dots,n$.
Now
$$\sum_{x=0}^ne^{tx}f_X(x)-\sum_{y=0}^ne^{ty}f_Y(y)=0$$
$$\Rightarrow \sum_{x=0}^ns^xf_X(x)-\sum_{y=0}^ns^yf_Y(y)=0$$
$$\Rightarrow \sum_{x=0}^ns^xf_X(x)-\sum_{x=0}^ns^xf_Y(x)=0$$
$$\Rightarrow\sum_{x=0}^ns^x[f_X(x)-f_Y(x)]=0$$
$$\Rightarrow \sum_{x=0}^ns^xc_x=0~∀s>0$$
The above is simply a polynomial in s with coefficients $c_0, c_1,\dots,c_n$. The only way it can be zero for all values of s is if $c_0=c_1=\cdots= c_n=0$.So, we have that $0=c_i=f_X(i)−f_Y(i)$ for $i=0, 1,\dots,n$.
Therefore, $f_X(i)=f_Y(i)$ for $i=0,1,\dots,n$.
In other words the density functions for $X$ and $Y$ are exactly the same. In other other words, $X$ and $Y$ have the same distributions.
In a sense, an MGF is simply a way of encoding a set of moments into a convenient function in a way that you can do some useful things with the function.
The variable $t$ in no way relates to the random variable $X$. You could as readily write $M_X(s)$ or $M_X(u)$... it is, in essence a kind of dummy variable. It doesn't stand for anything beyond being the argument of the mgf.
Herbert Wilf [1] calls a generating function:
a clothesline on which we hang up a sequence of numbers for display
It really wouldn't matter which exact clothesline you hung them on; another would do just as well.
Is there any way to derive the functions from anywhere?
There's more than one way to turn a set of moments into a generating function (e.g. a discrete distribution has a probability generating function, a moment generating function, a cumulant generating function and a characteristic function and you can recover the moments (in some cases less directly than others) from any of them.
So there's not a unique way to encode a set of moments into a function; it's a matter of choice about how you set it up. While they're similar (and, naturally, related), some are more convenient for particular kinds of tasks.
I see a certain analogy between mgf and Laplace transform and cf and Fourier transform.
Not merely an analogy, at least if we consider the bilateral Laplace transform (which I'll still denote as $\mathcal{L}$ here). We see $M_X(t) = \mathcal{L}_X(-t)$ is (at least up to a change of sign) really a Laplace transform (indeed, consider $\mathcal{L}_X(-t) =\mathcal{L}_{-X}(t)$, so it's the bilateral Laplace transform of a flipped variate). One can convert readily from one to the other, and use results for Laplace transforms on mgfs quite happily (and, for that matter, tables of Laplace transforms, if we keep that sign issue in mind). Similarly, characteristic functions are not merely analogous to Fourier transforms, they are Fourier transforms (again, up to the sign of the argument which is of no consequence outside the obvious effect swapping the sign of the argument has on a function).
If Fourier transforms and Laplace transforms help give you intuition about what mgfs and cfs "are" you should certainly exploit those intuitions, but on the other hand, it's not always necessary to have intuition when manipulating these things.
In fact when playing with cfs, because they always exist and are unique, I often tend to think of them as just the distribution looked at through a different lens.
I can see that taking the derivative of the function and evaluating at t=0 gives the moment (if the integral is absolutely convergent), but why?
Because the particular generating function we chose to use (the mgf) was set up to work that way. In order to be able to extract the set of moments from the function again you need something like that -- a way to eliminate all the lower ones (such as differentiation) and eliminate all the higher ones (such as set the argument to 0) so that you can pick out the exact one you need. For that to happen you already need something that works kind of like an mgf. At the same time, it's nice if it has some other properties you can exploit (as the various generating functions we use with random variables do), so that restricts our set of choices even further.
[1] Wilf, H. (1994)
generatingfunctionology, 2nd ed
Academic Press Inc., San Diego
https://www.math.upenn.edu/~wilf/DownldGF.html
Best Answer
Note that with a discrete distribution the contribution of an atom of probability at some particular $x$-value $x_0$ has a characteristic effect on the mgf.
For example, the probability $p_0$ at $X=0$ will have a term in the mgf of $p_0$.
A probability of $p_1$ at $X=1$ will have a term in the mgf of $p_1e^t$.
A probability of $p_2$ at $X=2$ will have a term in the mgf of $p_2e^{2t}$.
A probabiity of $p_i$ at $X=x_i$ will have a term in the mgf of $p_ie^{x_it}$.
This all follows directly from the definition of the mgf.
The overall mgf will be a sum of such terms.
Consequently, if you see in the mgf a sum of terms of the form $ae^{bt}$ that means that you have probability $a$ at $X=b$. If you can, simply convert all the terms to be in this form.
Once you have it in the form of a sum of terms like that, you can just write down the probability at any specific value.