In $\overline{MS}$ the $\beta$ function does not depend on the gauge parameter. This means that the dependence on $\mu$ in $Z$ only comes from the coupling constant $a \propto \mu^{-2 \epsilon}$.
For general $Z(a_s,\xi)$, the relation is as follows (eq. 21 in the Chetyrkin paper):
$$-\gamma =\left(-\epsilon + \beta(a_s) \right) a_s \frac{\partial \log Z}{\partial a_s}+\gamma_3(a_s,\xi)\xi \frac{\partial \log Z}{\partial \xi}$$
At least in a model with only one species, the mass is (inversely) related to the correlation length, so one way to build intuition about mass renormalization is to think about it in terms of how the interaction term affects the correlation length. This is meant to address the question "why do you need to renormalize the mass" in a relatively straightforward, mathematically clear way.
To be specific, consider the $\phi^4$ model. After replacing continuous $D$-dimensional space with a finite lattice to make everything mathematically unambiguous, the Hamiltonian may be written
$$
H = \frac{b^D}{2}\sum_\mathbf{x}\left(\dot\phi^2(\mathbf{x})+\sum_\mathbf{b}\left(\frac{\phi(\mathbf{x}+\mathbf{b})-\phi(\mathbf{x})}{b}\right)^2+\mu\phi^2(\mathbf{x})+\frac{\lambda}{12}\phi^4(\mathbf{x})\right)
$$
where $\mathbf{x}$ is a lattice site, $\mathbf{b}$ is a lattice basis vector, and $b$ is the lattice spacing. The sum over $\mathbf{x}$ is a lattice version of an integral over all space, and the term with the sum over $\textbf{b}$ is a lattice version of the gradient term $(\nabla\phi(\mathbf{x}))^2$. The overhead dot on $\dot\phi$ denotes the time-derivative of $\phi$ (in the Heisenberg picture), as usual.
Now, here's the intuition. First suppose that $\lambda=0$. In this case, we know that the physical mass $m$ is related to the coefficient $\mu$ by $\mu=m^2$. The kinetic term, the only term that connects different lattice sites, is responsible for the fact that correlation function
$$ \langle 0|\phi(\mathbf{x})\phi(\mathbf{y})|0\rangle - \langle 0|\phi(\mathbf{x})|0\rangle\,\langle 0|\phi(\mathbf{y})|0\rangle
$$
is non-zero for $\mathbf{x}\neq\mathbf{y}$. After re-scaling the field and the time-parameter to put the $\lambda=0$ Hamiltonian in the form
$$
H = \frac{b^D}{2}\sum_\mathbf{x}\left(\dot\phi^2(\mathbf{x})+\sum_\mathbf{b}\left(\frac{\phi(\mathbf{x}+\mathbf{b})-\phi(\mathbf{x})}{mb}\right)^2+\phi^2(\mathbf{x})\right),
$$
we see that the correlation length is necessarily determined by the combination $mb$, so the correlation length must be $1/m$ in units of the lattice spacing $b$.
Now suppose that $\lambda>0$ and $\mu=0$. Although the 2-point correlation function cannot be computed in closed from any more, the same scaling argument indicates that the correlation length is determined by the combination $b\sqrt{\lambda}$. The relationship between correlation and physical mass (which is related to identifying the physical mass with the pole in the propagator) then tells us that the physical mass must be non-zero in this case, even though $\mu=0$. In other words, when $\mu=0$, the physical mass is induced entirely by the coupling term.
To see what happens when $\mu$ and $\lambda$ are both non-zero, choose any value $\lambda\geq 0$ and consider how $\mu$ must be tuned to make the correlation length infinite (which corresponds to zero physical mass). If $\lambda=0$, then we know that the choice $\mu=0$ makes the correlation length infinite. We just saw that if $\lambda>0$, then the correlation length is finite if $\mu=0$, so to make the correlation length infinite again we must choose $\mu<0$ to offset the effect of the interaction term. Letting $\mu_c(\lambda)$ denote this special (negative) value of $\mu$ that makes the correlation length infinite, this says that if $\lambda>0$, then choosing $\mu>\mu_c(\lambda)$ will give a non-zero physical mass (that is, a finite correlation length), even if $\mu$ is still negative!
By the way, choosing $\mu<\mu_c(\lambda)$ gives spontaneous symmetry breaking (of the discrete $\phi\rightarrow -\phi$ symmetry).
This whole picture is confirmed by numerical calculations, some of which can be found in Luscher and Weisz (1987), "Scaling laws and triviality bounds in the lattice $\phi^4$ theory (I). One-component model in the symmetric phase", Nuclear Physics B 290: 25-60, and some in Hasenbusch (1999), "A Monte Carlo study of leading order scaling corrections of $\phi^4$ theory on a three dimensional lattice", https://arxiv.org/abs/hep-lat/9902026.
Best Answer
I believe that this may be what you’re looking for. I don’t claim to be an expert in renormalization, so all members of the Physics Stack Exchange Community are welcome to correct my answer.
To make the relationship between the two more precise, let’s call the on-shell renormalization scheme the ‘on-shell subtraction scheme’ and the BPHZ renormalization scheme the ‘BPHZ algorithm’.
The on-shell subtraction scheme tells us how to subtract the divergent parts of a Feynman diagram, while the BPHZ algorithm tells us how to apply the subtractions to a multi-loop Feynman diagram in a sensible manner so that the problem of overlapping divergences is overcome.
Hence, the relationship between the two concepts is precisely this:
Without a subtraction scheme that yields a subtraction operator, the BPHZ algorithm is useless, because we have to tell the algorithm how to subtract!
Without the BPHZ algorithm, the on-shell subtraction scheme is very limited in scope, because it doesn’t tell us how to apply subtractions properly to address the technical difficulties associated with overlapping divergences.
The BPHZ algorithm also works with other subtraction schemes, like the minimal subtraction scheme. It can also be used with the momentum subtraction scheme, of which the on-shell subtraction scheme is just a single instance.
Now, Anthony Duncan explains what I’ve just mentioned in awesome detail in his book The Conceptual Framework of Quantum Field Theory. You may also consult John Collins’ book Renormalization.