I think the situation is similar to that in algebra. In elementary school, you learned that $1+1=2$. It was kinda obvious, right? In rigorous advanced algebra, however, you first have to define “$1$”, “$2$”, “$+$” and then you must prove that $1+1=2$.
Similarly, probability theory at an undergraduate level uses some informal but intuitively sound notions when introducing the basics and how those foundations are built are largely left unsaid, presumably because the focus at this level ought to be on more interesting topics relying on these basics, such as combinatorics, distribution theory, statistics, practical applications, and so forth.
Only at a more advanced level do you realize that the foundations of probability theory are basically the same as those of measure theory under the special assumption that the measure of the whole space is normalized to one. The constructions and results from measure theory help you build a rigorous and consistent theory about what events and probabilities really are. The point is that at this higher level, there are no loose ends left and informal concepts that you were accustomed to during your undergraduate training (and accepted them without many reservations, since they felt intuitively right) are placed on rock-solid theoretical grounds.
I learnt about RN derivative from "Real Analysis" by Folland, and would advise you to check it out there (Chapter 3) as it may answer your coming questions. In particular, Theorem 3.5 answers your Q1. It state that
If $\nu$ is a finite signed measure and $\mu$ is a positive measure, then $\nu\ll \mu$ iff for any $\varepsilon >0 $ there exists $\delta > 0$ such that $\mu(E)<\delta$ implies $|\nu(E)|<\varepsilon $ for any mesaurable $E$.
Now, if $\mu$ is our probability measure and $F$ is the corresponding CDF, then choosing the following $E = \bigcup_{k=1}^n(t_k,t_{k+1}]$ gives us that $\nu\ll \lambda$ implies that $F$ is absolute continuous (as a function). Here $\lambda$ denote the Lebesgue measure.
Regarding Q2: the density is defined relatively to another measure. Whatever measure $Q$ you take, it always has density w.r.t. itself - please, tell me if this fact is not clear to you. Furthermore, indeed if $P = \lambda$ and $H = \delta_0$ then $Q$ does not admit density w.r.t. $P$, however it clearly admits density w.r.t. $Q$ itself.
In probability theory it may be confusing that most of the time we are talking about densities w.r.t. $\lambda$, so that we do not even mention $\lambda$ and say just "density". For that reason you may forget that we are talking about relative density, as there is no "absolute" density at least in measure theory. There density is exactly RN derivative, hence it requires specifying the "denominator" measure.
Q3: I am not sure what exactly you mean here. If $\nu\ll\mu$ we can define KL divergence by
$$
D(\nu,\mu) := \int \log\left(\frac{\mathrm d\nu}{\mathrm d\mu}\right)\mathrm d\nu = \int \frac{\mathrm d\nu}{\mathrm d\mu}\log\left(\frac{\mathrm d\nu}{\mathrm d\mu}\right)\mathrm d\mu \tag{1}
$$
and this is defined purely in terms of measures, so it does not depend on their representation through densities.
Regarding your title question, please check out this and that. I'm expecting you'll reconsider and (or) reformulate your question after reading this answer, unless everything already became clear to you. Just come back and we can proceed. And I encourage you to check Folland's book in general.
Added: let's agree on the following - since there is some confusion regarding the notion of the density, we only use terms "function" and "RN derivative". We can define KL divergence $D(\nu,\mu)$ for measures $\nu\ll\mu$ as in $(1)$. We can also fix some reference measure $\psi$ and define a similar map for functional arguments, that is let
$$
\bar D_\psi(g,f):= \int g \log\left(\frac gf\right)\mathrm d\psi \tag{1'}
$$
for which to be well-defined, we assume that
$$
\{f = 0\} \subseteq \{g = 0\} \tag{2}.
$$
Now, these two notions are relates as follows: $\bar D_\psi(g,f) = D(\bar\nu,\bar\mu)$ where
$$
\bar\nu(\cdot) := \int_{(\cdot)}g\,\mathrm d\psi\qquad \bar\mu(\cdot) := \int_{(\cdot)}f \,\mathrm d\psi
$$
and of course $(2)$ implies that $\nu\ll\mu$. So indeed, to talk about the set $\mathcal G$ of all functions $g$ you need to assume that every function from this set satisfies $(2)$: but if you don't assume that the KL divergence would be infinite for those $g$ (you take integral of $\log$ of infinity) so for sure it is greater than $\epsilon$.
Let me also summarize some relations in one-dimensional case. The basic object is the probability measure $\mu:\mathscr B(\Bbb R) \to [0,1]$. Its CDF is a function on real numbers $F_\mu:\Bbb R\to [0,1]$, which is given by $F_\mu(x):=\mu((-\infty,x])$; hence, to each probability measure there corresponds its unique CDF. Vice-versa, from any function satisfying a couple of properties we can construct a probability measure whose CDF is given by the latter function, see e.g. here. Thus, probability measures on real line and CDFs are in one-to-one correspondence, only the former is a function of sets, whereas the latter is the function of real numbers. If $\mu \ll \lambda$ then its RN derivative $f_\mu := \frac{\mathrm d\mu}{\mathrm d\lambda}:\Bbb R \to \Bbb R_+$ is commonly referred to as a density function of $\mu$, however it would be more formal to say that $f_\mu$ is the density of $\mu$ w.r.t. $\lambda$. Notice that
$$
F_\mu(x) = \int_{-\infty}^x\mathrm \mu(\mathrm dt) = \int_{-\infty}^xf_\mu(t)\, \lambda(\mathrm dt),
$$
hence if $\mu\ll\lambda$, then by LDT we have that $F'_\mu(x)$ exists $\lambda$-a.e. and $F'_\mu(x) = f_\mu(x)$ ($\lambda$-a.e.) For example, if $F_\mu\in C^1(\Bbb R)$ then $F'_\mu$ is a version of the RN derivative $\frac{\mathrm d\mu}{\mathrm d\lambda}$, and by changing $F'_\mu$ on $\lambda$-null sets in any way we can obtain other versions of that RN derivative (since RN derivative is only defined uniquely $\lambda$-a.e.). In fact, in most of the practical cases we compute RN derivatives using usual derivatives; there are not many other methods to compute RN derivatives.
Best Answer
I would say that conditioning and independence is something that disctinct, expectation is used a lot in the measure theory as well, by the name of the Lebesgue integral.
The point is that the probability as a science before was maybe even more closer to physics than to math by being based on experiments. It became the classical Probability Theory (PT) when it was axiomatized the first half of XX century by the means of Measure Theory (MT). So MT is clearly a mathematical basis for the classical PT and in that sense you can consider PT to be a subdiscipline of MT.
There are two moments to mention, though.
There is an algebraic approach to probability which starts with algebras of random variables and defines a linear functional on such algebras - which is an expectation. Shall we say that the Probability Theory is a subdiscipline of Abstract Algebra?
In both cases - you start with something empirical: probability, random variables etc. You wish them to satisfy some kind of properties and by this you bring a particular structure: either a measure-theoretical, or an algebraic. However, there is an additional meaning of the results that you obtain. For example, the Law of Large Numbers and Central Limit Theorem are obtained by using pure measure-theoretical methods. But these results are very important exactly for Probability Theory. The interpretation of MT via probabilistic ideas provides you additional intuition about "how should it be" and help to understand "what does it mean".
That is completely an opinion which I've chosen for myself. Hope that it helps.