Yes, the infrared divergences are guaranteed to be present in any meaningful computational scheme for a loop diagram, including dimensional regularization, because their existence is an objective question.
In dim reg, UV divergent integrals have to be calculated in $4-\epsilon$ dimensions so we imagine that the dimension is lower than the dimension of interest (or the dimension where the integral is divergent for the first time).
Similarly, IR divergences become worse in lower dimensions, so dim reg means that we imagine that we work in $4+\epsilon$ dimensions, or above the dimension where they get divergent. If both divergences are present, we must separate the integration interval to IR and UV portions and treat them independently, in a dimension that is oppositely moved away from four.
Dim reg should generally lead to simple poles only, $1/\epsilon$. The exponent above $\epsilon$, which is generally $-1$ for all divergent integrals in dim reg, isn't the same exponent that determines the "degree" of the divergence as measured in the momentum space.
The UV and IR divergences are treated analogously in dim reg except that we imagine the dimension to be lowered or raised, respectively. However, their physical interpretations are completely different. UV divergences imply the need to complete the definition of the theory at short distances. IR divergences exist even if the theory is completely well-defined and they encode genuine physics. They seem pathological but this should be blamed on the person who was asking the question. If he asks a more physical question that can actually be measured, the IR divergences cancel. So the presence of IR divergences doesn't mean that we should add counterterms with parameters whose values will matter.
The right treatment of the theory – renormalization – requires the cancellation of the $1/\epsilon$ terms in dim reg: they're analogous to $\log \Lambda$ terms in some cutoff treatments and have the same interpretation. So we're adding counterterms. You say that "all evidence of counterterms vanishes in the final answer" but this is a misleading interpretation. The more correct comment is that "all evidence of counterterms and divergent loop contributions constructed from the original interaction vertices vanishes in the final answer". They cancel. They're equally important. It's a mistake to forget one of them because it would lead you to a completely incorrect belief that the counterterms are unnecessary. They're necessary to cancel the divergences that were there to start with, before counterterms were added. The coefficients of the counterterms have an infinite part that cancels but the finite remainder survives and produces a genuine finite parameter.
Dim reg is a regularization that doesn't introduce any cutoff scale $\Lambda$ – indeed, divergent terms in the limit $\Lambda\to\infty$ are exactly those that we wanted to avoid by using a different scheme – but its results include $1/\epsilon$ terms that are analogous to $\log\Lambda$ in a cutoff approach. The power law divergent diagrams $\Lambda^n$ from a cutoff approach typically lead to convergent integrals in dim reg but their presence in the cutoff approach still implies a sensitivity on the detailed physics at a high-energy scale.
While dim reg has no $\Lambda$, it needs the scale $\mu$ to be added to the integrals over the momentum space for simple dimensional reasons (right units). We don't need to imagine that $\mu$ is as high as $\Lambda$ or that it is a cutoff scale; it may be a scale close to the mass scales where interesting phenomena occur in the theory.
Regardless of the choice of the regularization technique, the right way to think about fine-tuning etc. is to think about the sensitivity of the low-energy physics on the detailed values of the parameters chosen for the parameters at the high-energy scale. This dependence on dynamical parameters in the high-energy theory is also the correct invariant explanation of "what is actually wrong about the UV divergences". It's not the infinities themselves; what's bad is the infinite number of unknown parameters one may choose for the high-scale theory in a non-renormalizable theory.
I believe that the paragraphs above answer (almost?) all the questions by the OP but I chose not to reproduce the precise wording because it sometimes seemed confusing to me, encouraging the answer to be phrased in framework that isn't quite right.
The key thing is that you need to be working with canonically normalized fields in order to use the power counting arguments.
Let's expand GR around flat space
\begin{equation}
g_{\mu\nu} = \eta_{\mu\nu} + \tilde{h}_{\mu\nu}
\end{equation}
The reason for the tilde will become clear in a second. So long as $\tilde{h}$ is "small" (or more precisely so long as the curvature $R\sim (\partial^2 \tilde{h})$ is "small"), we can view GR as an effective field theory of a massless spin two particle living on flat Minkowski space.
Then the Einstein Hilbert action takes the schematic form
\begin{equation}
S_{EH}=\frac{M_{pl}^2}{2}\int d^4x \sqrt{-g} R = \frac{M_{pl}^2}{2} \int d^4x \ (\partial \tilde{h})^2 + (\partial \tilde{h})^2\tilde{h}+\cdots
\end{equation}
where $M_{pl}\sim 1/\sqrt{G}$ in units with $\hbar=c=1$. $M_{pl}$ has units of mass. In this form you might thing that the interaction $(\partial \tilde{h})^2 \tilde{h}$ comes with a scale $M^2_{pl}$ with a positive power. However this is too fast--all the QFT arguments you have seen have assumed that the kinetic term had a coefficient of -1/2, not $M_{pl}^2$. Relatedly, given that $M_{pl}$ has units of mass and the action has units of $(mass)^4$, the field $\tilde{h}$ is dimensionless, so it is clearly not normalized the same way as the standard field used in QFT textbooks.
Now classically, the action is only defined up to an overall constant, so we are free to think of $M_{pl}^2$ as being an arbitrary constant. However, in QFT, the action appears in the path integral $Z=\int D\tilde{h}e^{iS[\tilde{h}]/\hbar}$ (note the notational distinction between $\tilde{h}$ and $\hbar$). Thus the overall constant of the action is not a free parameter in QFT, it is fixed and has physical meaning. Alternatively, you have to remember that the Einstein Hilbert action will ultimately be coupled to matter; when we do that, the scale $M_{pl}$ sitting in front of $S_{EH}$ will not multiply the matter action, and so $M_{pl}$ sets the relative scale between the gravitational action and the matter action.
The punchline is that we can't simply ignore the overall scale $M_{pl}^2$, it has physical meaning (ie, we can't absorb $M_{pl}$ into an overall coefficient multiplying the action). On the other hand, we want to put the action into a "standard" form where the overall scale isn't there, so we can apply the normal intuition about power counting. The solution is to work with a "canononically normalized field" $h$, related to $\tilde{h}$ by
\begin{equation}
\tilde{h}_{\mu\nu} = \frac{h_{\mu\nu}}{M_{pl}}
\end{equation}
Then the Einstein Hilbert action takes the form
\begin{equation}
S_{EH} = \int d^4 x \ (\partial h)^2 + \frac{1}{M_{pl}} (\partial h)^2 h + \cdots
\end{equation}
In this form it is clear that the interactions of the form $(\partial h)^2 h$ have a "coupling constant" $1/M_{pl}$ with dimensions 1/mass, which is non-renormalizable by power counting in the usual way.
Best Answer
Superficial degree of divergence
The first step is to get a sense of what the divergent graphs of our theory are. "Graph" in the following means a 1-particle-irreducible Feynman graph for our theory, the amplitude of which we want to compute in momentum space.
Given such a graph, each internal link of type $i$ contributes to the amplitude a propagator that, in the UV limit behaves like $$\frac{1}{k^{D - 2 \sigma _i}}$$ where $D$ is the spacetime dimension and $\sigma _i$ is the mass dimension1 of the corresponding field ($\hbar$ and $c$ being taken to be dimensionless). For example, for a scalar field, the action (mass dimension $0$) goes like $\int \!d^{(D)} x\, (\partial \phi )^2$, from which we can deduce that the mass dimension of the field is $\sigma = \frac{D}{2} - 1$ (length/time has mass dimension $-1$), while the propagators behaves as $\frac{1}{k^2}$.
Each internal link also contribute an $\int d^{(D)}k$. But each internal vertex contributes a Dirac $\delta$ imposing momentum-conservation, which kills some of these integrals, leaving us to integrate only on one free impulsion per loop in the graph.
In the simplest case (ie the interaction terms don't involve any field derivative or anything unusual), each vertex of type $a$ just contributes the corresponding coupling constant $c_a$ (and the already mentioned Dirac $\delta$).
So if we put a cutoff $\Lambda$ on all integral, the UV behavior of our amplitude will be something like: $$\int ^{\Lambda } d^{(LD)}k\, k^{\sum _i I_i (2\sigma _i - D)} \sim \Lambda ^{LD + \sum _i I_i (2\sigma _i - D)}$$ where $L$ denotes the number of loops and $I_i$ the number of internal link of type $i$.
Doing a bit of graph combinatorics, we have the following relation: $$\forall i,\, E_i + 2 I_i = \sum _a V_a n_{a,i}$$ where $E_i$ is the number of external legs of type $i$ and $V_a$ is the number of vertex of type $a$, the latter having $n_{a,i}$ legs of type $i$ (indeed, each internal link must be attached at both its ends to some vertex, while the external legs have one end attached). And we also have: $$I = V + L -1$$ where $V = \sum _a V_a$ is the total number of vertices and $I = \sum _i I_i$ is the total number of internal links. To prove this, we start from a trivial graph, say two external legs connected to a 2-valent vertex (which obviously satisfy the relation), and we either:
add a new vertex, by dividing an existing internal link or external leg in two, which add $1$ vertex and $1$ internal link;
add a new external leg, attaching it to an existing vertex, which doesn't change the number of loops, vertices or internal links;
add a new internal link by connecting 2 existing vertices, which add $1$ loop and $1$ internal link.
Putting everything together, we get a UV behavior for our graph that goes like $\Lambda ^S$ where: $$\boxed{S = D - \sum _i E_i \sigma _i - \sum _a V_a \tau _a}$$ with $\tau _a \mathrel{\mathop:}= D - \sum _i n_{a,i} \sigma _i$. Note that the contribution to the action of the interaction of type $a$ is $\int \!d^{(D)} x\, c_a \prod _i \phi _i^{n_{a,i}}$, so $\tau _a$ is nothing but the mass dimension of $c_a$. In particular, if $S > 0$, the amplitude is going to diverge when we try to send the cutoff $\Lambda$ to $+\infty$.
We can check that the expression for $S$ makes sense with some quick and dirty dimensional analysis: to get an n-point function $\left\langle \Pi _i \phi _i^{E_i} \right\rangle$ (mass dimension $\sum _i E_i \sigma _i$), we need to multiply the graph amplitude by the on-shell propagator on each external leg (dimension $\sum _i E_i (2\sigma _i - D)$), include a Dirac $\delta$ for the overall momentum conservation ($\delta (\sum k)$, dimension $-D$) and perform an inverse Fourier transform for each external leg ($\int \!d^{(D)}k \dots$, dimension $\left(\sum _i E_i\right) D$), so the amplitude itself must have dimension $D - \sum _i E_i \sigma _i$. As it contains 1 factor of $c_a$ per vertex of type $a$ it must behave in the UV limit as $$\Pi _a \left(c_a\right)^{V_a} \, \Lambda ^{D - \sum _i E_i \sigma _i - \sum _a V_a \tau _a}$$ (at least for divergent graphs the dependence in the incoming momenta $k$ should be washed out in the limit $k \ll \Lambda$).
If $S = 0$, it is safe to assume some kind of logarithmic divergence (as in $\int \!dk\, \frac{1}{k}$), although we cannot a priori tell whether it's $\ln \Lambda$, $\ln^2 \Lambda$,... (the same is true for graphs with $S > 0$ of course: they may well be in fact $\Lambda ^S \, \ln \Lambda$ divergent for example).
Finally, it is important to stress that even a graph with $S < 0$ could be divergent, eg because it contains a divergent subgraph. That's why $S$ is called the superficial degree of divergence. While divergent subgraphs would contribute positively to the overall $\Lambda$ power of the graph, they might be hidden by a sufficiently negative power of $\Lambda$ in the rest of the graph.
Counter-terms
If $S<0$ for all graphs, great! our theory is probably fine2 as it is, and we can just start using it straight away. Otherwise, we need to do something about these divergent Feynman graphs. Let's see how it works for $\phi ^4$ in $D = 3+1 = 4$ dimensions. The dimension of the field is $\sigma = 1$ and the coupling constant has dimension $\tau = 0$, so $S = 4 - E$. Since $E = 4 V - 2 I$ (see above), only graphs with an even number of external legs can be constructed in this theory.
Ignoring divergent subgraphs for a moment, graphs with $2$ external legs will be divergent as $\Lambda ^2$, so if we resum all such graphs we get an effective 2-valent vertex
$\sum$ $=$ $= \text{finite part} + \alpha \Lambda ^2$
We can compensate the divergence by adding to the Lagrangian a term $-\alpha \Lambda ^2 \, \phi ^2$, ie mass renormalization. Let's denote by the corresponding new vertex. Resuming all graphs again, but now including the extra ones that can be constructed using this new vertex, we get a renormalized 2-valent vertex
$+$ $=$ $= \text{finite part}$
where the prime indicate that the new vertex as also been inserted in all possible way inside the graphs that contributes to : this takes care of their divergent 2-valent subgraphs.
Note that there is some ambiguity in the splitting "finite part" versus "divergent part" (given a diverging function $f(\Lambda )$ there is no canonical way of defining its "convergent part", especially since we do not want the result to be sensible to the details of how we impose the cutoff), so at this stage we only know the renormalized 2-valent vertex up to some constant. To fix this constant, we need to measure the 2-point function for some incoming momenta: this determines the physical, renormalized mass of our field, and knowing it we can then compute for all possible incoming momenta.
Next we do the same for the effective 4-valent vertex. This time, the superficial divergence is logarithmic, let's say $\ln \Lambda$, so we need to add to the Lagrangian a term $-\beta \ln \Lambda \, \phi ^4$, ie coupling constant renormalization. Again, ambiguity in the divergence removal will have to be fixed by measuring the physical value of coupling constant (at some energy).
And that's it! All graphs with an higher number of external legs are convergent, save for divergent subgraphs which will all be cured by the just added counter-terms. Indeed for any graph that include say a given divergent 2-valent subgraph , there will be an otherwise identical graph with the counter-term vertex in place of , that will exactly cancel out the divergence when resuming all graphs.
Power-counting renormalizability
So what goes wrong in a theory that has one or more coupling constant with negative mass dimension (eg $\phi ^6$ in $D = 4$, having $\tau = -2$)? Then, even if the mass dimensions $\sigma _i$ of the fields are all non-negative (which is normally the case), graphs with an arbitrarily high number of external legs can be made superficially divergent just by stashing into them enough internal vertices with negative $\tau$.
In addition, for a given number of external legs, the degree of superficial divergence can be made arbitrarily high (again by having enough internal vertices), but this would not necessarily be worrying on its own: it would just mean that we would need to add a counter-term with that particular leg structure and a $f(\Lambda )$ divergent prefactor which would be the sum of all possible $\Lambda ^S$-divergences. Still we could a priori fix the ambiguities associated with this particular counter-term with just 1 measurement of the corresponding $n$-point function.
The real trouble is that we need infinitely-many such counter-terms, and infinitely many measurements to fix the corresponding true, physical coupling constants. So the criterion for power-counting renormalizability can be stated as:
In a first approximation, the counter-terms that will have to be added to the Lagrangian of a renormalizable theory correspond to those leg structures which admit superficially divergent graphs.
To be honest, I have never seen a fully convincing argument as to why it could not happen that a counter-term with a certain number of legs would, at the same time, magically cure the superficial divergences of graphs with more legs. Especially in a theory where the degree of divergence at $n$ legs can be made arbitrarily high, highly-divergent embedded $n$-valent subgraphs could, on their own, account for the superficial divergence of graphs with arbitrary leg structures. Nevertheless, in Zinn-Justin's "Quantum field theory and critical phenomena", those theories where the degree of divergence can be made arbitrarily high are deemed to be hopelessly divergent while theories that have infinitely many superficially divergent leg structures albeit with a bounded degree of divergence (which can happen if there is both a field and a coupling constant with mass dimension $0$) are said to be "generally non-renormalizable". I guess the rationale here is something along the lines:
in practice, a counter-term with a certain leg structure is unlikely to cure all divergences of superficially divergent graphs with a different leg structure;
and power-counting is anyway not meant as a definitive, mathematically rigorous proof of (non-)renormalizability, but rather as a quick diagnostic tool, to estimate how seriously ill the theory is.
1 There are exceptions to this rule, but it is good enough for us here, as it holds for scalar and spin $\frac{1}{2}$ fields. See the Zinn-Justin book referred to above for more details.
2 This should not be confused with so-called "super-renormalizablity", a terminology reserved for theories that, while having some UV divergences have only finitely many superficially divergent graphs. This occurs for example if all coupling constant have strictly positive mass dimensions (and all fields have non-negative mass dimension). Then, all superficially divergent graphs must have less than a certain maximal number of vertices (which also puts a bound on the number of external legs, since $E \leqslant \sum _a V_a n_{a}$), and as the number of graphs with a given number of vertices is finite, the number of superficially divergent graphs can then be shown to be finite. On the contrary, $\phi ^4$ in $4$ dimensions is renormalizable but not super-renormalizable because, while all its superficially divergent graphs have either 2 or 4 external legs, there are infinitely many of them (for example, , ,$\dots$ are all superficially divergent).