I'm not sure if this is still of interest to you, but I think it is possible to get some reasonable bounds if you are okay with dropping the factor of $\frac{1}{2}$. Here's my work, which can be strengthened and refined.
We start by taking two probability mass functions $p$ and $q$ which we denote as $p_i$ and $q_i$. We define the function $f$ as $f_i= q_i-p_i$. Instead of doing anything fancy, we consider the line segment $p_i(t) = p_i +tf_i$. Since $f$ has total mass zero, the $p_i(t)$ are well defined probability distributions that form a straight line in the probability simplex. We can see that $p_i(0) = p_i$ and $p_i(1) = q_i$
Now we take the Taylor series for the Kullback-Liebler divergence, expanded at $t=0$. This will involve the Fisher metric, but we should expand it out further to get better results.
When we expand out $(p_i + t f_i)\log\left( \frac{p_i + t f_i}{p_i} \right)$, we get the following:
$$f_i t+\frac{f_i^2 t^2}{2 p_i}-\frac{f_i^3 t^3}{6 p_i^2}+\frac{f_i^4 t^4}{12 p_i^3}-\frac{f_i^5 t^5}{20 p_i^4}+\frac{f_i^6 t^6}{30 p_i^5}+O\left(t^7\right)$$
When we sum over $i$, the first term will vanish, and we can factor out a Fisher metric term from all of the others. I will use an integral sign to sum over $i$, as it is suggestive of what should happen in the continuous case.
$$\int f_i t+\frac{f_i^2 t^2}{2 p_i}-\frac{f_i^3 t^3}{6 p_i^2}+\frac{f_i^4 t^4}{12 p_i^3} \ldots \,di = \int \frac{f_i^2 t^2}{ p_i} \left( \frac{1}{2} - \frac{f_i t}{6 p_i} + \frac{f_i^2 t^2}{12 p_i^2} \ldots \right) di $$
We find that the terms in the parenthesis on the right hand side can be simplified. We set $x_i = \frac{f_i t}{p_i}$ and can derive the following:
$$\left( \frac{1}{2} - \frac{x_i}{6} + \frac{x_i}{12} \ldots \right) = \sum_{k=0}^\infty \frac{(-1)^k x_i^k}{(k+1)(k+2)} = \frac{(x_i +1)\log(x_i+1)-x_i}{x_i^2}$$
This should not be surprising; it's very closely related to the original formula for the Kullbeck-Liebler divergence. In fact, we didn't need Taylor series except to know to subtract off the pesky $t f_i$ term. Therefore, we don't need to worry about the convergence, this manipulation is valid without the series. Therefore,
$$KL(p(t), p) = \int \frac{f_i^2 t^2}{ p_i} \left( \frac{( x_i +1)\log(x_i+1)-x_i}{x_i^2} \right) di $$
In order for this to make sense, we need to make sure that $x_i= \frac{f_i t}{p_i} \geq -1$. However, $\frac{f_i}{p_i} = \frac{q_i}{p_i} - 1 \geq -1$. Even better, it turns out that $ \frac{( x_i +1)\log(x_i+1)-x_i}{x_i^2} \leq 1$ on its domain. With this, we are done, because this implies
$$KL(q,p)< I_p(f,f).$$
This implies that we can bound the Kullback-Liebler divergence by the Fisher information metric evaluated on a particular vector $f$. Since the KL-divergence can blow up, it's worthwhile to see what happens in this case. Whenever this happens, the tangent vector $f$ at $p$ is large in the Fisher metric.
The Kullback-Leibler divergence $D_{\rm KL}(Q||P)$ of two distributions $Q,P$ has been generalized to multiple distributions in various ways:
[1] information radius: $R(P_1,\ldots P_k)=\frac{1}{k}\sum_{i=1}^k D_{\rm KL}(P_i||k^{-1}\sum_i P_i)$
[2] average divergence: $K(P_1,\ldots P_k)=\frac{1}{k(k-1)}\sum_{i,j=1}^k D_{\rm KL}(P_i||P_j)$
[3,4] dissimilarity: the weighted arithmetic mean of the KL distances between each of the $P_i$’s and the barycenter of all the $P_i$’s
References
[1] Robin Sibson, Information radius. Probability Theory and Related Fields, 14, 149–160 (1969).
[2] Andrea Sgarro, Informational divergence and the dissimilarity of probability distributions. Calcolo, 18, 293–302 (1981).
[3] Michèlle Basseville, Divergence measures for statistical data processing (2010).
[4] Darío García-García and Robert C. Williamson, Divergences and Risks for Multiclass Experiments (2012).
Best Answer
Maximum likelihood Estimation:
Let $X_1,\dots,X_n$ be independently and identically distributed observations from a distribution modeled by the parametric family $\mathcal{F} = \{P_{\theta}:\theta\in\Theta\}$. Let us suppose that all the distributions in $\mathcal{F}$ have a common finite support set $\mathcal{X}$. The maximum likelihood estimation (MLE) corresponds to the probability distribution $P_{\theta}$ which maximizes $\small\prod_{i=1}^n P_{\theta}(X_i)$. Let $\hat{P}$ be the empirical distribution of the observations. Then \begin{eqnarray} \small\frac{\prod_{i=1}^n P_{\theta}(X_i)}{\prod_{i=1}^n \hat P(X_i)} & = & \small\prod_{x\in\mathcal{X}} \Big(\frac{P_{\theta}(x)}{\hat P(x)}\Big)^{n\hat P(x)}\\ & = & \small \exp\Big\{n\sum\limits_{x\in\mathcal{X}}\hat P(x)\log\Big(\frac{P_{\theta}(x)}{\hat P(x)}\Big)\Big\}\\ & = & \small\exp\{-nD(\hat P\|P_{\theta})\}. \end{eqnarray}
Thus maximizing $\small\prod_{i=1}^n P_{\theta}(X_i)$ is same as minimizing $\small D(\hat P\|P_{\theta})$.
Source: I. Csiszar and P. C. Shields, “Information Theory and Statistics: A Tutorial.