You might look at Chapter 3 of Devroye, Gyorfi, and Lugosi, A Probabilistic Theory of Pattern Recognition, Springer, 1996. See, in particular, the section on $f$-divergences.
$f$-Divergences can be viewed as a generalization of Kullback--Leibler (or, alternatively, KL can be viewed as a special case of an $f$-Divergence).
The general form is
$$
D_f(p, q) = \int q(x) f\left(\frac{p(x)}{q(x)}\right) \, \lambda(dx) ,
$$
where $\lambda$ is a measure that dominates the measures associated with $p$ and $q$ and $f(\cdot)$ is a convex function satisfying $f(1) = 0$. (If $p(x)$ and $q(x)$ are densities with respect to Lebesgue measure, just substitute the notation $dx$ for $\lambda(dx)$ and you're good to go.)
We recover KL by taking $f(x) = x \log x$. We can get the Hellinger difference via $f(x) = (1 - \sqrt{x})^2$ and we get the total-variation or $L_1$ distance by taking $f(x) = \frac{1}{2} |x - 1|$. The latter gives
$$
D_{\mathrm{TV}}(p, q) = \frac{1}{2} \int |p(x) - q(x)| \, dx
$$
Note that this last one at least gives you a finite answer.
In another little book entitled Density Estimation: The $L_1$ View, Devroye argues strongly for the use of this latter distance due to its many nice invariance properties (among others). This latter book is probably a little harder to get a hold of than the former and, as the title suggests, a bit more specialized.
Addendum: Via this question, I became aware that it appears that the measure that @Didier proposes is (up to a constant) known as the Jensen-Shannon Divergence. If you follow the link to the answer provided in that question, you'll see that it turns out that the square-root of this quantity is actually a metric and was previously recognized in the literature to be a special case of an $f$-divergence. I found it interesting that we seem to have collectively "reinvented" the wheel (rather quickly) via the discussion of this question. The interpretation I gave to it in the comment below @Didier's response was also previously recognized. All around, kind of neat, actually.
The KL-divergence is typically used in information-theoretic settings, or even Bayesian settings, to measure the information change between distributions before and after applying some inference, for example. It's not a distance in the typical (metric) sense, because of lack of symmetry and triangle inequality, and so it's used in places where the directionality is meaningful.
The KS-distance is typically used in the context of a non-parametric test. In fact, I've rarely seen it used as a generic "distance between distributions", where the $\ell_1$ distance, the Jensen-Shannon distance, and other distances are more common.
Best Answer
The Kullback-Leibler divergence is not a distance: it is not even symmetric, and you could (and most likely will) get completely different results by orders of magnitudes depending on what is your reference measure.
The proper way of answering your question is to use the Wasserstein distance, in particular Wasserstein-2. This is a proper distance defined in the settings of theory of optimal transport. The Kantorovich formulation of optimal transport is shown below.
I will first detail the theoretical idea, then present one practical solution (not the only one) that can be easily implemented.
The basic idea is the following, given two measures $\mu$, $\nu$, you can quantify how 'close' they are by measuring how much kinetic energy it would take you to deform one to the other.
In other words, if you had to move the mass from $\mu$ to $\nu$ by hand, and the cost of transport of one unit of mass is proportional to the distance squared, you are trying to minimize the total cost of transport.
Call $\pi(x,y)$ the amount of mass moved from $x$ to $y$. Then your objective function is
$$ \min_{\pi \geq 0} \iint |x-y|^2 \pi(x,y) dx dy $$ subject to the constraints $$ \int \pi(x,y) dx = \nu(y) \quad \text{(all the mass at $y$ comes from somewhere)} $$ $$ \int \pi(x,y) dy = \mu(x) \quad \text{(all the mass at $x$ goes somewhere)} $$ How do we solve this in practice? Well this is merely a linear programming programming problem in infinite dimensions.
For instance, if you had iid data samples, $(x_i)_{i=1,..,n}$ and $(y_j)_{j=1,..,m}$, you are seeking an assignment $\pi_{ij}$ that minimizes the transport cost. One way is to solve the finite dimension linear program $$ \min_{\pi_{ij} \geq 0} \sum_{i,j} |x_i-y_j|^2 \pi_{ij} $$ subject to $$ \sum_i \pi_{ij} = \frac{1}{m} $$ $$ \sum_j \pi_{ij} = \frac{1}{n} $$ More references on computational optimal transport: https://arxiv.org/pdf/1803.00567.pdf