Question 1
Nontrivial RG flow is the result of explicit breaking of classical theory scale invariance in corresponding quantum field theory. If there is no dimensionful parameters in classical lagrangian of corresponding theory (the generalization on the presence of masses is straightforward, but is not relevant here), we naively expect that after scaling transformation,
$$
\Phi (x) \to e^{\sigma \epsilon}\Phi (e^{\epsilon}x), \quad x \to e^{\epsilon}x,
$$
with $\sigma$ being the canonical dimension of $\Phi$ field and $\epsilon$ being the continuous parameter of transformation, the action will be unchanged. Corresponding symmetry defines dilatation current conservation law:
$$
\partial_{\mu}D^{\mu} =0
$$
This naive law is completely broken by infinities of QFT (such breaking is called the trace anomaly), because of which regularization enters the game. I.e, we introduce the dimensionful parameter by hand, and initial free from dimensionful parameters lagrangian begins to contain the one, called $\mu$. Since it is dimensionful, it is called the scale of theory. However, it is unphysical, and we can't say that it is the scale on which theory is defined: for example, we can choose it so that it coincides with square of transfer momentum, but it is only formal correspondence which depends on our wish.
In general, because of the presence of scale the dilatation current conservation law is modified by quantum corrections. For example, for massless QED
$$
\partial_{\mu}D^{\mu} \sim \beta (\alpha)F_{\mu \nu}F^{\mu \nu}, \quad \alpha \equiv \frac{e^{2}}{4 \pi}
$$
This leads to nontrivial behaviour of lagrangian parameters (like couplings) with changing of $\mu$.
What's about your question, $\mu$ as the scale on which theory is defined? The answer is dimensional transmutation phenomena, which occurs because of described above breaking of scaling invarince. Precisely, by solving RG equation (here $\alpha$ is the running coupling)
$$
\mu\frac{d\alpha}{d\mu} = \beta (\alpha (\mu))
$$
we obtain that
$$
\alpha (\mu) = f(\mu , \mu_{0}, \alpha (\mu_{0}))
$$
We can invert this relation and use dimensionful parameter $\mu_{0}$ instead of $\alpha $ in perturbation theory (and this is what people call the dimensional transmutation). Such parameter is really physical scale: it defines set of theory parameters. For example, for QCD it defines the strong coupling scale, which is closely related to the confinement and chiral symmetry breaking scale $\Lambda_{QCD}$. The latter determines the scale at which effective theory which describes hadrons interactions works.
Question 2
1. General remarks
The scheme of renormalization precisely defines renormalization constants, including their finite part. In general, the renormalization constants are given as (for example, for dimensional regularization)
$$
\tag 0 Z_{i} = a_{i} + \sum_{j = 1}\frac{c^{(i)}_{j}}{\epsilon^{j}},
$$
We have the freedom for choosing $a_{i}$, while $c^{(i)}_{j}$ are completely fixed by the structure of infinities in theory. The renormalization group states that the scheme dependence of the physical observables is absent.
Your question is following: suppose that we have specifit renormalization scheme for which the scale parameter $\mu$ doesn't affect parameters of theory - particularly, the mass parameter, which is fixed by the pole of propagator - why do we introduce the other scheme, for which the mass becomes to run and the RG equations enter the game?
The specific renormalization scheme is called on-shell scheme, while the convenient scheme with the precence of the scale in the expression for the mass is called minimal subtraction. So what's the point?
2. On-shell renormalization scheme: limitations
Let's assume that you use on-shell renormalization scheme. For this scheme $a_{i}$ is not zero, and it is uniquely fixed by specific conditions.
Lets assume the simplest case - scalar theory with self-interaction, and lets concentrate on the mass renormalization. After computing the self-energy by keeping this scheme you have that the propagator is
$$
D^{-1}(p^{2}) = p^{2} - m_{\text{pole}}^{2} - \Sigma (p^{2}),
$$
where $m_{\text{pole}}$ is the physical mass, for which
$$
\tag 1 D^{-1}(m_{\text{pole}}^{2}) = 0
$$
(since it is observable than it doesn't depend on the $\mu$ scale), and $\Sigma (p^{2})$ is self-energy. Eq. $(1)$ expicitly reads
$$
\tag 2 \Sigma (p^{2} = m_{\text{pole}}^{2}) = 0
$$
Also, the requirement that the propagator has the unity residue leads to the statement that
$$
\tag 3 \left(\frac{d\Sigma (p^{2})}{dp^{2}}\right)_{p^{2} = m_{\text{pole}}^{2}} = 0
$$
This condition, let me remind, in fact is nothing but the requirement that the propagator corresponds to the one-particle state.
Note two things about $\Sigma (p^{2})$ in on-shell scheme. The first one is that it doesn't depend on the scale $\mu$ since the mass $m_{\text{pole}}$ is indeed scale independent, and this result, of course, is regularization scheme independent. The second is that the condition $(3)$ can't be satisfied in the limit of $m_{\text{pole}}^{2} = 0$, since in massless limit Callen-Lemman representation of the propagator (which just binds the pole of the Green function with one-particle state) doesn't contain isolated pole: the one-particle state with zero energy isn't different from many-particle states.
We can't deal with this problem without introducing regularizing scale $m_{\text{reg}}^{2}$. It is indeed unphysical, and in general this is the price for obtaining $\mu$-scale independent quantities in theories with massless states. Note that the most of realistic theories are ones with massless states. For example, QED in on-shell prescription suffers from IR divergences in self-energy because of exactly zero photon mass.
To avoid such singularities, we need to change the renormalization scheme.
3. Extra: the minimal subtraction scheme
For this scheme, all $a_{i}$s in Eq. $(0)$ are zero. So that, Eq. $(1)$ now is
$$
D^{-1}(p^{2}) = p^{2} - m^{2} - \Sigma (p^{2}, \alpha , \mu),
$$
where $\alpha$ is the set of other couplings which are present in theory (here they are couplings for cubic, quartic terms).
For $p^{2} = m_{\text{pole}}^{2}$
$$
D^{-1}(m_{\text{pole}}^{2}) = m_{\text{pole}}^{2} - m^{2} - \Sigma (m_{\text{pole}}^{2}(m^{2}, \alpha , \mu), \alpha, \mu) = 0,
$$
or at the lowest order of perturbation theory
$$
m_{\text{pole}}^{2} = m^{2} + \Sigma (m^{2}, \alpha , \mu)
$$
In this scheme $\Sigma$ depends on $\mu$ explicitly. But the $m_{\text{pole}}$ doesn't depend on it, so that we come to the statement that $m^{2}$ and $\alpha$ depend on $\mu$.
Best Answer
At least in a model with only one species, the mass is (inversely) related to the correlation length, so one way to build intuition about mass renormalization is to think about it in terms of how the interaction term affects the correlation length. This is meant to address the question "why do you need to renormalize the mass" in a relatively straightforward, mathematically clear way.
To be specific, consider the $\phi^4$ model. After replacing continuous $D$-dimensional space with a finite lattice to make everything mathematically unambiguous, the Hamiltonian may be written $$ H = \frac{b^D}{2}\sum_\mathbf{x}\left(\dot\phi^2(\mathbf{x})+\sum_\mathbf{b}\left(\frac{\phi(\mathbf{x}+\mathbf{b})-\phi(\mathbf{x})}{b}\right)^2+\mu\phi^2(\mathbf{x})+\frac{\lambda}{12}\phi^4(\mathbf{x})\right) $$ where $\mathbf{x}$ is a lattice site, $\mathbf{b}$ is a lattice basis vector, and $b$ is the lattice spacing. The sum over $\mathbf{x}$ is a lattice version of an integral over all space, and the term with the sum over $\textbf{b}$ is a lattice version of the gradient term $(\nabla\phi(\mathbf{x}))^2$. The overhead dot on $\dot\phi$ denotes the time-derivative of $\phi$ (in the Heisenberg picture), as usual.
Now, here's the intuition. First suppose that $\lambda=0$. In this case, we know that the physical mass $m$ is related to the coefficient $\mu$ by $\mu=m^2$. The kinetic term, the only term that connects different lattice sites, is responsible for the fact that correlation function $$ \langle 0|\phi(\mathbf{x})\phi(\mathbf{y})|0\rangle - \langle 0|\phi(\mathbf{x})|0\rangle\,\langle 0|\phi(\mathbf{y})|0\rangle $$ is non-zero for $\mathbf{x}\neq\mathbf{y}$. After re-scaling the field and the time-parameter to put the $\lambda=0$ Hamiltonian in the form $$ H = \frac{b^D}{2}\sum_\mathbf{x}\left(\dot\phi^2(\mathbf{x})+\sum_\mathbf{b}\left(\frac{\phi(\mathbf{x}+\mathbf{b})-\phi(\mathbf{x})}{mb}\right)^2+\phi^2(\mathbf{x})\right), $$ we see that the correlation length is necessarily determined by the combination $mb$, so the correlation length must be $1/m$ in units of the lattice spacing $b$.
Now suppose that $\lambda>0$ and $\mu=0$. Although the 2-point correlation function cannot be computed in closed from any more, the same scaling argument indicates that the correlation length is determined by the combination $b\sqrt{\lambda}$. The relationship between correlation and physical mass (which is related to identifying the physical mass with the pole in the propagator) then tells us that the physical mass must be non-zero in this case, even though $\mu=0$. In other words, when $\mu=0$, the physical mass is induced entirely by the coupling term.
To see what happens when $\mu$ and $\lambda$ are both non-zero, choose any value $\lambda\geq 0$ and consider how $\mu$ must be tuned to make the correlation length infinite (which corresponds to zero physical mass). If $\lambda=0$, then we know that the choice $\mu=0$ makes the correlation length infinite. We just saw that if $\lambda>0$, then the correlation length is finite if $\mu=0$, so to make the correlation length infinite again we must choose $\mu<0$ to offset the effect of the interaction term. Letting $\mu_c(\lambda)$ denote this special (negative) value of $\mu$ that makes the correlation length infinite, this says that if $\lambda>0$, then choosing $\mu>\mu_c(\lambda)$ will give a non-zero physical mass (that is, a finite correlation length), even if $\mu$ is still negative!
By the way, choosing $\mu<\mu_c(\lambda)$ gives spontaneous symmetry breaking (of the discrete $\phi\rightarrow -\phi$ symmetry).
This whole picture is confirmed by numerical calculations, some of which can be found in Luscher and Weisz (1987), "Scaling laws and triviality bounds in the lattice $\phi^4$ theory (I). One-component model in the symmetric phase", Nuclear Physics B 290: 25-60, and some in Hasenbusch (1999), "A Monte Carlo study of leading order scaling corrections of $\phi^4$ theory on a three dimensional lattice", https://arxiv.org/abs/hep-lat/9902026.