In 1988 I published a paper (Relativistic Quantum Field Theory of the $\sigma$ + $\omega$ Model, Proceedings of the Internatiional Workshop on Relativistic Nuclear Many-Body Physics, 6-9 June 1988, World Scientific Press) that utilized the scalar Yukawa theory (ie your Lagrangian) for a calculation of infinite nuclear matter. I had begun this calculation in 1975 but did not complete it until 1988. While not quite as general as your question, my work did address the renormalization issues of this Lagrangian and my findings shed some light on the answer to your question. Briefly, there was no need for counter terms for the $\phi$ and $\phi^3$ fluctuations in the specific problem that I addressed. For the $\phi$ term, the fluctuation vanished by definition (since I was expanding about the finite expectation value $<\phi>$).
The $\phi^3$ fluctuation term also vanished, but the explanation was somewhat more involved. I treated the $\phi^4$ fluctuation via a harmonic oscillator motivated approximation ($<\phi^4>=3<\phi^2>^2$) and in that approximation the $\phi^3$ fluctuation was also easily shown to vanish. I used this approach to perform a nonperturbative calculation that relied upon a variational technique to sum some of the boson loop terms to infinite order. The approach that I employed was called quantum hadrodynamics (QHD) but it had already fallen out of favor by the time I completed the work due to the more fundamental QCD approach.
The two theories, namely the ``gradient model'' $\partial_\mu\pi \bar{\Psi}\gamma^5\gamma^\mu\Psi$ and the Yukawa model $g\pi\bar{\Psi}\gamma^5\Psi$ (both with a massive $\Psi$), are definitely not equivalent. They have different symmetries, spectrum and scattering amplitudes, hence are physically distinct theories.
The main mistake that you (the OP) are doing is using the free equations of motion for the fermions, but that's ok only for the external legs and not for the virtual ones that enter e.g. in the one-loop calculation of the $\pi$ mass, or as I'll show below in a scattering amplitude with an intermediate virtual $\psi$ exchanged. (The mistake that Cosmas Zachos was doing in his earlier answer and that in part is still doing in the marginally improved answer is explained in my comments to his answer, I will not repeat it here).
The gradient model is indeed invariant under $\pi\rightarrow \pi+const$ which clearly forbids a mass term for $\pi$. This isn't the case for the Yukawa model where a bare mass is needed to remove the quadratic divergent mass generated by the fermion loops. A physical pole mass is therefore generically non-zero, barring fine-tuning.
More importantly, Goldstone bosons (GB's) aren't just massless particles, they have various special features. For example, soft GBs (that is the limit of vanishing $\pi$-momentum) give vanishing scattering amplitudes (the so-called Adler zero condition). This is realized for the gradient theory but not for the Yukawa theory.
Let's see this in more detail looking at a physical scattering amplitude $\pi\Psi\rightarrow \pi\Psi$.
For the Yukawa theory one has
$$
M^{Yukawa}_{\pi\Psi\rightarrow\pi\Psi}=-g^2 \left[\bar{u}(p_2^\prime)\frac{i(\gamma_\alpha p_1^\alpha+\gamma_\alpha p_2^\alpha-m)}{s-m^2+i\epsilon}u(p_2)+\mbox{crossed diag.}\right]
$$
for $\pi(p_1)\Psi(p_2)\rightarrow\pi(p_1^\prime)\Psi(p_2^\prime)$. The $\gamma^5$ have been moved around and simplified with the numerator of the fermion propagator, i.e. $\gamma^5i(\gamma_\alpha p_1^\alpha+\gamma_\alpha p_2^\alpha+m)\gamma^5=-i(\gamma_\alpha p_1^\alpha+\gamma_\alpha p_2^\alpha-m)$. We could have simplified the numerator using $\gamma_\alpha p_2^\alpha u(p_2)=m u(p_2)$, where $m$ is the fermion mass, but it's more convenient this form in the following. There is an s-channel contribution, explicitly displayed, along with a crossed diagram that we do not display explicitly.
(disclaimer: I am doing this calculation by hand on an Ipad, I hope it is not grossly incorrect :-), although factors of 2 and minus signs are most likely off)
This $M^{Yukawa}_{\pi\Psi\rightarrow\pi\Psi}$ doesn't vanish for $p_1\rightarrow 0$ because, even though the numerator goes to zero (namely $\gamma_\alpha p_1^\alpha+\gamma_\alpha p_2^\alpha-m)u(p_2)=\gamma_\alpha p_1^\alpha u(p_2)\rightarrow 0$), so does the denominator at the same rate ($s-m^2=2p_{1\alpha} p_2^\alpha\rightarrow0$; here I am assuming that we have tuned the spectrum to be the same, that is the $\pi$ mass in the Yukawa model has been tuned to zero by hand, otherwise the numerator wouldn't even vanish and the comparison between the two models would make no sense).
On the other hand, for the gradient theory we get
$$
M^{gradient}_{\pi\Psi\rightarrow\pi\Psi}=\frac{1}{f^2}\left[\bar{u}(p_2^\prime)\frac{i(\gamma_\alpha p_1^\alpha+\gamma_\alpha p_2^\alpha-m)^3}{s-m^2+i\epsilon}u(p_2)+\mbox{crossed diag.}\right]\rightarrow 0
$$
which is not only different (hence the two theories are physically distinct, period) but it gives a vanishing amplitude in the GB soft limit $p_1\rightarrow 0$ since the numerator can be written as $i\bar{u}(p_2^\prime)(\gamma_\alpha p_1^\alpha+\gamma_\alpha p_2^\alpha-m)^3u(p_2)=i\bar{u}(p_2^\prime)\gamma_\alpha p_1^{\prime\alpha}(\gamma_\beta p_1^\beta+\gamma_\beta p_2^\beta-m)(\gamma_\gamma p_1^\gamma)u(p_2)$, and for momentum conservation $p_1^\prime\rightarrow 0$ too.
The takeaway message is: the two models are distinct physically and mathematically. The gradient theory describes a GB whereas the Yukawa theory describe a scalar with a mass tuned to be zero.
Extra edits
I have finally found some times to add a last remark that I mentioned in the comments but it is actually worth to report in the full answer. It is also related to the answer by @Cosmas Zachos.
Having established that the two theories are different, one may wonder how much different and what is the relation between the two, given that the simple use of the equations of motion by the OP was flawed.
The answer is very simple: the two theories differ at the non-linear $\pi$-level, starting from the quadratic order. In particular, the claim is that the theory given by
$$
\mathcal{L}_{new}=\bar{\Psi}(i\gamma^\alpha \partial_\alpha-m e^{-2i\gamma^5\pi/f})\Psi+\frac{1}{2}(\partial_\mu\pi)^2\qquad f\equiv 2m/g\,,
$$
which differs from $\mathcal{L}_{Yukawa}=\bar{\Psi}(i\gamma^\alpha \partial_\alpha-m)\Psi+ig\pi \bar{\Psi}\gamma^5\Psi+\frac{1}{2}(\partial_\mu\pi)^2$ starting from $o(\pi^2)$, is in fact equivalent to the gradient theory
$$
\mathcal{L}_{gradient}=\frac{1}{2}(\partial_\mu\pi)^2+\bar{\Psi}(i\gamma^\alpha \partial_\alpha-m)\Psi+\frac{1}{f}\partial_\mu\pi \bar{\Psi}\gamma^5\gamma^\mu\Psi\,.
$$
Indeed, it's enough to perform the field redefinition $\Psi\rightarrow e^{i\gamma^5 \pi/f}\Psi$ to move the $\pi$ from the non-derivative term to the gradient coming from the $\Psi$-kinetic term.
As ultimate check, let's see the behavior under the soft limit $p_1\rightarrow 0$. The contributions from two linear-$\pi$ vertex insertions is like in the Yukawa theory, but now there is also a contact term coming from expanding the exponential, $\frac{2m}{f^2}\pi^2\bar{\Psi}\Psi$, i.e.
$$
M^{new}_{\pi\Psi\rightarrow\pi\Psi}=M^{Yukawa}_{\pi\Psi\rightarrow\pi\Psi}+i\frac{4m}{f^2}\bar{u}(p_2^\prime)u(p_2)\,.
$$
Now, in the soft limit $p_1\rightarrow 0$, we have $p_2^\prime \rightarrow p_2$ hence the Yukawa terms give
$$
M^{Yukawa}_{p_1\rightarrow 0}=-g^2 \left[\bar{u}(p_2)\frac{i\gamma_\alpha p_1^\alpha}{2 p_1 p_2}u(p_2)+\mbox{crossed diag.}\right]=-2ig^2
$$
where I have used that $\bar{u}(p)\gamma^\alpha u(p)=2p^\alpha$. On the other hand, the new contact term in from $\mathcal{L}^{new}$ gives
$$
i\frac{4m}{f^2}\bar{u}(p_2)u(p_2)=i\frac{8m^2}{f^2}=2i g^2\,,
$$
from $\bar{u}(p)u(p)=2m$ (no sum over the two spin orientations, we are considering definite polarizations). Summing the two contributions we see that they vanish each other out, in agreement with the Adler-zero condition.
But again, the equivalence with the gradient theory is achieved only after modifying the theory at the $o(\pi^2)$ level in the way shown above, which corresponds to render the $\pi$ a GB. The Yukawa coupling alone instead is for non-GB particles.
Best Answer
The renormalization condition is exactly that the propagator has a pole and residue of 1 at \begin{equation} \gamma^\mu p_\mu=M \end{equation} which leads to \begin{equation} p^2=(\gamma p)^2=M^2 \end{equation} I think your problem here is that you mermorized the renormalization condition as $p^2=-M^2$ which is not true. To understand the condition properly, just bring up the free-field propagator \begin{equation} \frac{i}{p^2-M^2} \end{equation} or \begin{equation} \frac{i}{\gamma^\mu p_\mu-M} \end{equation} notice the real-corrected propagator behaves the same as the free-field near the pole, in this way you may not have a memoral mistake.