8/14: Substantially edited in response to comments: added to 1st part, added new 2nd and 4th parts
There's also some discussion underneath, and a link to a partial write-up of a case of cut-elimination involving the Ackermann function
If you haven't found the Gentzen-style proof illustrative, I recommend trying the infinitary proof. (Not everyone likes it, but it's at least an alternate perspective on it.) Pohlers "Proof Theory: An Introduction" gives a nice presentation of the proof, and it includes the proof in PA that ordinals below $\epsilon_0$ are well-founded. (This includes that fact that you use $N$-quantifier induction to proof well-foundedness up to $N+1$ exponents, and provides some insight into way additional quantifiers make it have the effect they do.)
One nice feature of the infinitary proof is that the ordinals are outright bounds on the heights of proofs, so if you're comfortable visualizing infinitary proofs, the source of the increase in bounds is quite explicit.
It's worth noting that "infinitary" is a bit of a misnomer. The infinitary proof takes the perspective that a proof of $\forall x\phi(x)$ should be a computable function $f$ so that, for each $n$, $f(n)$ is a proof of $\phi(n)$. Since the functions are all computable, there's nothing genuinely infinitary about it. People usually ignore the "computable" part for expository purposes, since it makes sense without that requirement, at which point it does look infinitary, but this is just because many of the ideas are clearer without constantly rechecking that the operations we're describing really are computable.
$ $
A follow-up point about what ordinals mean in proof theory. As pointed out in the comments, $\omega$ doesn't really mean "infinite" it means "an unspecified fixed integer". Similarly, $\omega+3$ really means "not only is this integer unspecified, but it will take three steps to figure out which integer it is". So in the infinitary proof of cut-elimination, when we say that a proof has height $\omega$, we mean that it is a proof of, say, $\forall x\phi(x)$, where the height of proof of $\phi(n)$ is $g(n)$ with $g(n)\rightarrow\infty$.
A proof of height $\omega+3$ is a proof with three additional steps below a proof of height $\omega$. More interestingly, a proof of height $\omega+\omega$ might be a proof of $\forall x\phi(x)\vee\forall y\psi(y)$ where for each $n$, we have a proof of $\phi(n)\vee\forall y\psi(y)$ of height $\omega+n$. To find a quantifier-free instantiation of this proof, we'd have to first plug in an $n$ for $x$ and then, at the appropriate level, an $m$ for $y$ to get a proof of $\phi(n)\vee\psi(m)$ of height $g(n)+h(n,m)$.
$ $
Speaking very loosely, the intuitions you describe above are perfectly sensible, but they basically only correspond to what's happening with fairly simple (mostly one quantifier) formulas. Cut-elimination corresponds to extracting computable information, and when you cut formulas with multiple quantifiers together, information has to flow back and forth between the two sides.
Informally, I think of it like this: suppose I want to cut a proof of
$$(\forall x\exists y\phi)\vee\psi$$
with a proof of
$$(\exists x\forall y\neg\phi)\vee\psi'$$
(cut-elimination is usually easier to think of in a one-sided form). The second proof produces a first proposal for a value of $x$ witnessing $\forall y\neg\phi$. The first proof might refute it by exhibiting a $y$. The second proof can now use that value of $y$ to compute a second guess at what $x$ is, the first proof refutes it again, and so on back and forth. This could go back and forth a large number of times (on the order of the ordinal of the second proof---that is, not just a fixed number of times, but, if the ordinal is something like $\omega+1$, a number of times based on the value of the first $y$, or if it's $\omega+\omega+1$, the first $y$ tells us how many steps we go before a value of $y$ which tells us how many steps we go, etc.)
This description is literally true when viewed through the lens of the functional interpretation. When viewed in terms of infinitary cut-elimination, these "back and forths" correspond to interleaved cuts: we place many copies of the first proof (more precisely, proofs of $\exists y\phi(n,y)\vee\psi$ for various values of $n$) throughout the second proof. Each of these creates new cuts, which correspond to taking sections of the second proof and placing them inside the many copies of the first proof which we've just created.
$ $
I think your indexing on the relationship between the fast growing hierarchy and proof-theoretic ordinals is off by an exponential. The proof theoretic ordinal of $I\Sigma_1$ is $\omega^\omega$, but it only proves the primitive recursive functions total (i.e. $f_n$ with $n<\omega$ in the fast-growing hierarchy). I'm not finding a reference on the exact relationship, but I would guess that the gap is consistent---that in the way two quantifier induction is like $\omega^{\omega^n}$, the Conway chain notation is like $\omega^{\omega^2}$. So, while two quantifier induction suffices, dealing with Conway chain notation will look like a substantial use of two-quantifier induction (indeed, I think it involves three levels of nested induction, two of which are over two-quantifier formulas, which would line up well with an ordinal of $\omega^{\omega^2}$).
Having said that, your question is still perfectly sensible: how does the back and forth I described above get things that are very fast growing. Suppose the first proof, of $(\forall x\exists y\phi)\vee\psi$, has a sensible rate of growth, say, exponential. The second proof is iterating the rate of growth of the first proof, and can iterate it an ordinal number of times based on the height of the second proof. Since the second proof could easily (given perhaps some additional cuts over smaller formulas) have, say, height $\omega^2$. That is, if we take $f_0(n)=n^n$, $f_{\alpha+1}(n)=f_\alpha(f_\alpha(n))$, and $f_\lambda(n)=f_{\lambda[n]}(n)$, we should hit around level $f_{\omega^2}(n)$ in this hierarchy.
Best Answer
This is really an expansion of Aaron Bergman and others' comments as well as information that exists on the Wikipedia articles on Goodstein's theorem and epsilon numbers. I will sketch a proof of Goodstein's theorem and you'll have to tell me if Kronecker would have accepted it as a legitimate argument. (I'm not sure if this is the best 'elementary looking' proof and I don't know if this would be considered a long proof. If somebody knows a simpler proof that looks elementary please tell me. I'm also not sure how constructive this argument can be. As written it certainly isn't.)
Let $T$ be the collection of finite rooted trees. For each natural number $n$, let $T_n \subset T$ be the collection of rooted finite trees of height at most $n$ (we'll say that the rooted tree with a single node has height $0$).
We will define a linear order $\sqsubset$ on $T$ inductively by defining it on each $T_n$. $T_0$ has a single element, so the order on $T_0$ is the trivial order. Now for any $n$, given the order on $T_n$, for any $a \in T_{n+1}$ we can associate a finite sequence of elements of $T_n$, which we'll write $s(a)$, by listing the children of the root note of $a$ in descending order according to $\sqsubset$ (which we have already defined). Given $a,b \in T_{n+1}$, we say that $a\sqsubset b$ if $a\neq b$ and at the first index $i$ at which $s(a)$ and $s(b)$ differ, either $s(a)(i)$ doesn't exist or it does and $s(a)(i) \sqsubset s(b)(i)$.
It is not hard to show that $\sqsubset$ is a linear order on $T_{n+1}$, that $\sqsubset$ as defined in $T_{n+1}$ agrees with $\sqsubset$ as defined on $T_n$, and that $T_n$ is an initial segment of $T_{n+1}$ under $\sqsubset$. Therefore we can treat $\sqsubset$ as a linear order on all of $T$. Note that while $T$ is an infinite set, the elements of $T$ are finite objects (and can be coded by natural numbers) and the relation $a\sqsubset b$ is computable.
Proof. We will show this by induction on $n$. The proposition is obvious for $n=0$. Assume that we've shown it for $T_{n}$ and let $f : \mathbb{N} \rightarrow T_{n+1}$ be a non-increasing function. Clearly if $f(k)$ is ever the unique element of $T_0$, then we're done, so assume that $f(k)$ is never in $T_0$.
Claim: For every $i \in \mathbb{N}$, the partial function $k \mapsto s(f(k))(i)$ is eventually constant (i.e. either eventually always defined and a constant value or eventually always undefined).
Proof of claim. We will prove this by induction on $i$. For $i = 0$, we have that $k \mapsto s(f(k))(0)$ is a non-increasing function whose codomain is $T_n$. By the overall induction hypothesis (on $n$), this implies that $k \mapsto s(f(k))(0)$ is eventually constant.
Assume that we've shown the claim for some $i$ and consider the partial function $k \mapsto s(f(k))(i+1)$. If $k \mapsto s(f(k))(i)$ is eventually always undefined, then $k \mapsto s(f(k))(i+1)$ is also eventually always undefined and we're done, so assume that $k \mapsto s(f(k))(i)$ is eventually always defined and let it be constant on inputs larger than $m$. Consider $k \mapsto s(f(m + k))(i+1)$. If this is ever undefined, then it is undefined for any larger inputs as well, so assume that it is always defined. In this case, we have that this is a non-increasing function from $\mathbb{N}$ to $T_n$, so by the overall induction hypothesis we have that it must eventually be constant as well. So in both cases it is eventually constant.
Therefore by induction, $k \mapsto s(f(k))(i)$ is eventually constant for every $i \in \mathbb{N}$, as required. $\square_{\text{claim}}$
For each $i$, let $g(i)$ be the eventual value of the map $k \mapsto s(f(k))(i)$ (where we take $g(i)$ to be undefined if $s(f(k))(i)$ is undefined for sufficiently large $k$). We have that $g(i)$ is defined on an initial segment of the natural numbers, possibly all of them. Furthermore, since $s(a)$ is always in decreasing order, we have that $g(i)$ is a non-increasing partial function from $\mathbb{N}$ to $T_n$. This means that if $g(i)$ is defined for all $i$, by the induction hypothesis we have that it is eventually constant.
Assume for the sake of contradiction that $g(i)$ is defined for all $i$. Let $a$ be its eventual value and let $j$ be some index such that $g(j) = a$. Choose $m$ such that for all $k \in \mathbb{N}$, $s(f(m+k))(j) = a$. Since $k \mapsto s(f(m+k))(j)$ is constant, we have that $k\mapsto s(f(m+k))(j+1)$ is non-increasing, and in particular must be defined for all $k$, otherwise $g(j+1)$ would be undefined. By construction we have that $s(f(m))(j+1) \leq s(f(m))(j)$, so since the eventual value of $k \mapsto s(f(m))(j+1)$ is also $a$, we must actually have $s(f(m+k))(j+1) = a$ for all $k \in \mathbb{N}$. Iterating this argument shows by induction that for any $\ell \geq j$, $s(f(m+k))(\ell) = a$ for all $k \in \mathbb{N}$, so in particular $s(f(m))(\ell) = a$ for all $\ell \geq j$, but this contradicts the fact that $s(f(m))$ is a finite string.
Therefore $g(i)$ is undefined for all but finitely many $i$. Let $j$ be the largest such that for all $i \leq j$, $g(i)$ is defined. Find $m$ such that $s(f(m+k))(i)$ is constant for all $i \leq j$ and is always undefined for $i = j+1$. Since $s(f(m+k))(j+1)$ is always undefined, we have that $s(f(m+k))(\ell)$ is always undefined for any $\ell > j$, which implies that $k \mapsto f(m+k)$ is in fact a constant map, so we have that $f$ is eventually constant, as required. $\square$
Now from this lemma we immediately get that if $f : \mathbb{N} \rightarrow T$ is non-increasing under $\sqsubset$, then it is eventually constant, because each $T_n$ is an initial segment of $T$ and $f(0) \in T_n$ for some $n$. (Of course this is just the fact that $\varepsilon_0$ is well-ordered expressed with a particular notation for elements of $\varepsilon_0$, but I won't tell Kronecker if you don't.)
At this point the rest of the proof should be fairly familiar. Given any Goodstein sequence, $g_n$, we associate a sequence of elements of $T$ by converting the hereditary base-$(n+2)$ notation representation of $g_n$ to a finite rooted tree by the following: $0$ corresponds to the tree with a single node and the number $a_k (n+2)^k + a_{k-1} (n+2)^{k-1} + \dots + a_0$ corresponds to a tree where for each $\ell \leq k$, we attach $a_\ell$ copies of the tree associated to $\ell$ to the root node. It is clear that replacing $n+2$ by $n+3$ does not change the associated tree, but subtracting $1$ does produce a strictly smaller tree under $\sqsubset$. If we had a Goodstein sequence which did not terminate it would give us a function $f: \mathbb{N} \rightarrow T$ which was non-increasing but also not eventually constant, which contradicts our lemma. Therefore every Goodstein sequence must terminate.
Regarding the metamathematical properties of this proof, the place where we are using second order arithmetic is in the main induction in the lemma, because the statement is for all functions $f: \mathbb{N} \rightarrow T_n$, which is using a second-order quantifier. The thing is that we don't really need this full strength for Goodstein's theorem. It is good enough to know it for computable $f$ (since the function associated to a Goodstein sequence is computable), but note that in the inductive proof we use the induction hypothesis on $g(i)$, which is not a priori a computable function, even when $f$ itself is. Furthermore, even though $g(i)$ is in general only a partial function with finite domain (and hence a computable object), this is non-uniform. If we have a family of functions $f_p: \mathbb{N} \rightarrow T_n$ and we want to know the associated family $g_p(i)$, this will typically require the Turing jump of the family $f_p$ to compute. This is a sign that these proofs won't work very uniformly in Peano arithmetic, especially since unpacking a proof of the lemma for a specific $n$ will end up needing something like $n$ Turing jumps.
We can formalize an encoding of finite rooted trees as natural numbers in Peano arithmetic and prove many basic properties (e.g. everything before the lemma). For any finite $n$ and any family $f_p(k)$ of uniformly definable functions $f_p : \mathbb{N} \rightarrow T_n$, Peano arithmetic proves the following:
The whole issue is that $\mathsf{PA}$ cannot prove this uniformly in $n$ for sufficiently complicated families $f_p$, such as, for instance, the family of computable functions.