In addition to the algebraic reason that Robert Israel gave, there's a very nice "moral reason" that the Kullback-Leibler divergence is not symmetric. Roughly speaking, it's because you should think of the two arguments of the KL divergence as different kinds of things: the first argument is empirical data, and the second argument is a model you're comparing the data to. Here's how it works.
Take a bunch of independent random variables $X_1, \ldots, X_n$ whose possible values lie in a finite set.* Say these variables are identically distributed, with $\operatorname{Pr}(X_i = x) = p_x$. Let $F_{n,x}$ be the number of variables whose values are equal to $x$. The list $F_n$ is a random variable, often called the "empirical frequency distribution" of the $X_i$. What does $F_n$ look like when $n$ is very large?
More specifically, let's try to estimate the probabilities of the possible values of $F_n$. Since the set of possible values is different for different $n$, take a sequence of frequency distributions $f_1, f_2, f_3, \ldots$ approaching a fixed frequency distribution $f$. It turns out** that
$$\lim_{n \to \infty} \tfrac{1}{n} \ln \operatorname{Pr}(F_n = f_n) = -\operatorname{KL}(f, p).$$
In other words, the Kullback-Leibler divergence of $f$ from $p$ lets you estimate the probability of getting an empirical frequency distribution close to $f$ from a large number of independent random variables with distribution $p$.
You can find everything I just said, and more, in the excellent article "Information Theory, Relative Entropy and Statistics," by François Bavaud.
* You can also do this more generally, but I don't know anything about that.
** Using Stirling's approximation, $\ln k! \in k\ln k - k + O(\ln k)$.
It is known that Min-KL and ML do coincide in the full esponential family. See, for example, here.
In these cases, all ML theory applies.
In other cases, min-KL can still be seen as an M-estimator (a.k.a. Empirical Risk Minimizer). In which case, M-estimation theory applies.
Best Answer
Your second question can be answered precisely: $$\mathbf{D}(\mathrm{Bin}(n,p)\,\|\,\mathrm{Bin}(n-1,q)) = \infty$$ as the second (reference) distribution places 0 mass on $n$, whereas the first distribution places nonzero mass on $n$.
As for your first question, even though the exact expression is too complicated, it can be approximated extremely well with the following expression $$\mathbf{D}(\mathrm{Bin}(n-1,p)\,\|\,\mathrm{Bin}(n,q))\approx n\mathbf{D}\left(\left(1-\frac1n\right)p\,\middle\|\,q\right).$$
To see this, consider the following probabilistic experiment. The random bits $X_1,X_2,\ldots,X_n$ are chosen as follows. First pick a uniform random coordinate $J\in[n]$ and set $X_J = 0$. Pick rest of the coordinates independently 1 with probability $p$ and 0 with probability $1-p$.
The random bits $Y_1,Y_2\ldots,Y_n$ are chosen as so: Pick each to be 1 with probability $q$ and 0 with probablity $1-q$ independently. Observe that $$\mathbf{D}(\mathrm{Bin}(n-1,p)\,\|\,\mathrm{Bin}(n,q)) = \mathbf{D}(X\,\|\,Y).$$
The right hand side is approximated up to an additive $\log n$ by $n\mathbf{D}((1-1/n)p\,\|\,q)$.
By the way, the equality written in the question can also be obtained this way with no calculation. $$\mathbf{D}(\mathrm{Bin}(n,p)\,\|\,\mathrm{Bin}(n,p)) = n\mathbf{D}(p\,\|\,q)$$ Bit string $X$ sampled by setting each of the $n$ bits 1 with probability p independently and bit string $Y$ is sampled the same way, except with probability q. Now $\mathbf{D}(\mathrm{Bin}(n,p)\,\|\,\mathrm{Bin}(n,p)) =\mathbf{D}(X\,\|\,Y)$.