In very rough outline, Gentzen's proof is as follows. He defines primitive recursive (with respect to standard codings) functions $C$ from proofs in PA to ordinals $<\epsilon_0$ and $E$ from proofs to proofs, with the following properties, for all proofs $p$. (1) If $C(p)>0$ then $C(E(p))<C(p)$. (2) The conclusion of $E(p)$ is the same as that of $p$. (3) If the conclusion of $p$ is $0=1$ then $C(p)>0$. These three properties are proved by finitary reasoning formalizable in PRA.
Given this information plus the well-ordering property of $\epsilon_0$, he gets the consistency of PA as follows. Suppose there were a PA-proof of $0=1$. Then we can generate more proofs of $0=1$ by iterating $E$: $p_0=p$ and $p_{n+1}=E(p_n)$. All $p_n$ are proofs of $0=1$ by (2); $C(p_n)>0$ by (3); and $C(p_{n+1})<C(p_n)$ by (1). But an infinite decreasing sequence of ordinals $C(p_n)$ contradicts the well-ordering assumption. So there cannot be a PA-proof of $0=1$.
Note that well-ordering was used only in the weaker form "There is no infinite, decreasing, primitive recursive sequence in $\epsilon_0$."
A little more about $C$ and $E$: In the deductive systems used by Gentzen, a particular rule called "cut" is responsible for most of the proof-theoretic complexity. In particular, it is trivial that $0=1$ cannot be proved without using cut (property (3) above). $C(p)$ measures "how much" cut is involved in the proof $p$; this is not just a matter of the number of applications of the cut-rule in $p$ but also the complexity of the formulas to which this rule is applied. $E(p)$ is the result of one step in a process called cut-elimination. The whole process would, as its name says, get rid of the applications of the cut rule in a proof, while maintaining the correctness and the conclusion of the proof (property (2) above). One step of the process, i.e., $E$, only reduces the cut-rank, i.e., property (1) above.
This is not exactly an answer to your question, since it's more about the particular result that $\mathsf{PA} + \neg\mathsf{Con}(\mathsf{PA})$ is interpretable in $\mathsf{PA}$. Still, I hope it is of some help.
This result was proved by Feferman in his important paper "Arithmetization of Metamathematics in a General Setting" (it's theorem 6.5 in the paper). The notation in the paper is a bit heavy and old-fashioned, though, so it may take some time to get used to it. I'm not going to prove his result here, since the proof is rather laborious, but I do want to make some quick remarks about it.
(1) The first thing to note is that the result relies on the following fundamental point made in Feferman's paper, namely that some care must be taken when handling consistency statements (for an introductory treatment of the same point, cf. chapter 36 of Peter Smith's Gödels book). In particular, in that paper, Feferman constructs consistency statements relative to a particular formula coding the theory, and this may be relevant.
That is, if $T$ is a theory, then he roughly says that a formula $\alpha(x)$ of the language of arithmetic numerates the theory if for every sentence $\phi$ of the language of the theory, $\phi \in T$ iff $\mathsf{Q} \vdash \alpha(\ulcorner \phi \urcorner)$, where $\mathsf{Q}$ is Robinson's arithmetic. If, moreover, $\alpha$ is such that if $\phi \not \in T$ iff $\mathsf{Q} \vdash \neg \alpha(\ulcorner \phi \urcorner)$, then $\alpha$ is said to bi-numerate the theory. (These notions correspond to what is generally called weakly represent and strongly represent, respectively.)
Anyway, the point is that consistency statements are relative to such bi-numerations, so that they are better expressed by Feferman's notation $\mathsf{Con}_\alpha(T)$. Indeed, by exploiting some coding tricks, Feferman shows (Theorem 5.9) roughly that if $T$ is a recursive consistent extension of $\mathsf{PA}$, then there is a rather strange bi-numeration $\alpha^*$ of $T$ such that $T \vdash \mathsf{Con}_{\alpha^*}(T)$. In particular (Corollary 5.10), there is a bi-numeration $\pi^*$ of $\mathsf{PA}$ such that $\mathsf{PA} \vdash \mathsf{Con}_{\pi^*}(\mathsf{PA})$.
As Feferman notes, this does not contradicts Gödel's theorem because these bi-numerations do not "properly express membership" in the given theory: "Indeed, inspection of the proof of 5.9 reveals that it expresses membership in a certain subsystem of [$T$] which, independent of the consistency of [$T$], is always consistent" (p. 69).
In fact, using a similar technique, Feferman also shows that, letting $\alpha$ be a bi-numeration of $\mathsf{PA}$ and setting $T=\mathsf{PA} + \neg \mathsf{Con}_\alpha(\mathsf{PA})$, there is a bi-numeration $\beta^*$ of $T$ such that $\mathsf{PA} \vdash \mathsf{Con}_{\beta^*}(T)$! (This is theorem 5.11.)
Using these results, Feferman then shows, first, that if $T$ is a theory and $\alpha$ is a numeration of $T$, then $T$ is interpretable in $\mathsf{PA} + \mathsf{Con}_\alpha(T)$ (theorem 6.2), which he then uses to prove (a result that implies) that there is a numeration $\alpha$ of $\mathsf{PA}$ such that $\mathsf{PA} + \neg \mathsf{Con}_\alpha(\mathsf{PA})$ is interpretable in $\mathsf{PA}$. The idea is roughly this, again using $T = \mathsf{PA} + \neg\mathsf{Con}_\alpha(\mathsf{PA})$:
By 5.11 there is a bi-numeration $\beta^*$ such that $\mathsf{PA} \vdash \mathsf{Con}_{\beta^*}(T)$. By 6.2, $T$ is interpretable in $\mathsf{PA} + \mathsf{Con}_{\beta^*}(T)$. But we have just seen that $\mathsf{PA} + \mathsf{Con}_{\beta^*}(T)$ just is $\mathsf{PA}$. Hence, the result follows.
(2) As Feferman notes, this roughly means that "we can construct a 'non-standard model' of [$\mathsf{PA}$] within [$\mathsf{PA}$] which, moreover, we can verify, axiom by axiom, to be a model of [$\mathsf{PA} + \neg\mathsf{Con}_{\beta^*}(\mathsf{PA})$]" (p. 77). Moreover, as he also observes in a footnote appended to this text, this is not all that surprising, given Gödel's theorems. The second incompleteness theorems basically states that, if $\mathsf{PA}$ is consistent, then so is $\mathsf{PA}$ extended with the negation of its consistency statement. Given that we generally use interpretations to prove relative consistency, this is basically a translation of that idea to relative consistency proofs.
EDIT: Well, I'm far from an expert on these questions, but here are my two cents (take these with lots of salt!):
First, it seems to me that there are two separate issues in the background of your question: (i) whether interpretation preserves meaning and (ii) whether the incompleteness of $\mathsf{ZF} + \neg \mathsf{Inf}$ is due to the negation of the axiom of infinity or some other limitation (the issue about arbitrary sets). Let's tackle these in order.
(i) It is true that just having an interpretation between theories is generally not sufficient to preserve "meaning", whatever that is. Indeed, one can have two theories being mutually interpretable without these interpretations preserving nice properties such as decidability, etc. Still, when two theories are mutually interpretable, perhaps the interpretations are of such a nature that we might as well identify the two theories. Here is a reasonable test. Suppose two theories, $T$ and $T'$, are mutually interpretable with interpretations $i: T \rightarrow T'$ and $j: T' \rightarrow T$. Suppose moreover that $i \circ j$ is the identity on $T'$ and $j \circ i$ is the identity on $T$, i.e. when I translate a formula using $i$, and then translate back using $j$, I always end up with the formula which I originally started (and vice-versa). When this situation takes place, say that the theories are bi-interpretable. Now, bi-interpretation can be reasonably taken to imply sameness of "meaning", since it preserves most of the interesting properties of the theories (technically, this is usually called synonymy, but the difference here is not relevant---cf. this article by Friedman and Visser for more on the difference).
So, given this, what is the situation with $\mathsf{PA}$ and $\mathsf{ZF} + \neg\mathsf{Inf}$ (which I'll call in the sequel $\mathsf{ZF}_{\mathsf{Fin}}$)? These theories are mutually interpretable, but, disappoitingly, they are not bi-interpretable. In fact, they don't satisfy a weaker requirement of "sentential equivalence", as shown by Enayat, Schmerl, and Visser in "$\omega$-models of Finite Set Theory", theorem 5.1; the proof is not difficult, but it does use some facts about the model theory of $\mathsf{PA}$ (and of $\mathsf{ZF}_{\mathsf{Fin}}$).
On the other hand, there is a theory in the vicinity which is bi-interpretable with $\mathsf{PA}$, namely $\mathsf{ZF}_{\mathsf{Fin}} + \mathsf{TC}$, where $\mathsf{TC}$ is the axiom which states that every set is contained in a transitive set (cf. this article by Kaye and Wong). The issue is related to $\varepsilon$-induction: adding this axiom is essentially equivalent to adding $\varepsilon$-induction (see again the article by Kaye and Wong). So there is a very strong sense in which these two theories are the same.
(ii) On the other hand, there is the issue about whether the fact that $\mathsf{ZF}_{\mathsf{Fin}} + \mathsf{TC}$ cannot prove Goodstein's theorem is due to the absence of an axiom of infinity or something else. From what I understand, he's probably referring to the fact that first-order $\mathsf{ZFC}$ does not fully capture the idea of an arbitrary set (cf. this article by Ferreirós for an analysis of the notion). Now, I'm really out of my depth here, but I thought the issue arose only for infinite sets. Is there an hereditarily finite set that is not first-order definable? If not, then his complaint is mute. If, however, there are such sets, then he may be on to something.
Best Answer
This is far from optimal, but it is fun:
Note that PRA isn't literally in the same language as PA, so by "PRA" we really mean "an appropriate rephrasing of PRA in the language of PA." This isn't an issue in this context, but it's worth pointing out.
Let's shift attention. The theory $I\Sigma_1$ consists of basic arithmetic (the ordered semiring axioms) and induction for $\Sigma_1$-formulas. It's not hard to check that $I\Sigma_1$ contains PRA ($I\Sigma_1$ proves "every primitive recursive function is total").
So it's enough to show that "$I\Sigma_1+\epsilon_0$" doesn't contain PA. Why is $I\Sigma_1$ a better theory to use here? Well, it turns out that $I\Sigma_1$ is finitely axiomatizable. Hence $I\Sigma_1+\epsilon_0$ is also finitely axiomatizable.
... So what? Well, this is the neat bit: there is a beautiful model-theoretic proof of the incompleteness of PA (due to Kripke, written up by Putnam) which has as a corollary that no finitely axiomatizable extension of PA in the language of PA is consistent (the key aspect of PA being that it proves the consistency of each of its finite subtheories). Bam, we're done.