Let $|j|$ denote the number of times event $j$ occurs in the $n$ trials.
In the proof below, I essentially use the inclusion exclusion principle.
Suppose event $i$ occurs at least twice. The probability of all such sequences is
$$P_n\{|i|\ge 2\} = \binom{n}{2} p_i^2. $$
This can be thought of as the following: Pick $2$ locations out of $n$ where event $i$ occurs. The rest are chosen arbitrarily, and summing over those choices leaves us with the above.
The probability that at least one event occurs at least twice is then upper bounded by
$$P_n\{\exists i : |i| \ge 2\}\le\binom{n}{2} \sum_i p_i^2.$$
Now, the probability that events $i$ and $j \neq i$ occur at least twice is precisely
$$P_n\{|i|\ge 2, |j| \ge 2\} = \frac{1}{2} \binom{n}{2}\binom{n-2}{2} p_i^2 p_j^2$$
This is the same argument as before. The factor of half is introduced because the events where $i$ is placed first and then $j$ and the events where $j$ is placed first and then $i$ are the same.
Thus, the probability that at least two events occur at least twice is upper bounded by:
$$P_n\{\exists i,j : |i|, |j| \ge 2\}\le \frac{1}{2} \binom{n}{2}\binom{n-2}{2} \sum_{j\neq i}\sum_i p_i^2 p_j^2 \le \frac{1}{2}\left(\binom{n}{2}\sum_i p_i^2 \right)^2$$
Thus, the probability that precisely one event occurs more than once is lower bounded by
\begin{align}
P_n\{ \exists i : |i| \ge 2\} &\ge \binom{n}{2} \sum_i p_i^2 - \frac{1}{2}\left(\binom{n}{2}\sum_i p_i^2 \right)^2
\end{align}
This is using the inclusion-exclusion principle up to the second term, and the fact that terms are decreasing with order in the expansion therein (this may need more sophisticated justification).
Define $x:= \binom{n}{2} \sum_i p_i^2$. Merging the results above,
$$ 1 - x + \frac{x^2}{2} \ge P_n\{ |i|=1~\forall i\} \ge 1 - x$$
Which uses the fact that $P_n\{ |i| \le 1~\forall i\} = 1 - P_n\{ \exists i : |i| \ge 2\}$
As $x \to 0$, i.e., the regime where $\binom{n}{2}\sum p_i^2 \to 0$, both the left and right hand side tend to $\exp{-x}$. Consequently, we establish the very crude approximation:
$$P_n\{ |i| \le 1~\forall i\} \approx \exp \left( -\binom{n}{2} \left(\sum_{i \in [1:k]} p_i^2\right)\right)$$
From my understanding, the equation can be rewritten as (Cha, 2007)
$$
2\sqrt{1 - \sum_{i=1}^n\sqrt{p_iq_i}}
$$
Here we can see that the part below is basically the geometric mean. This mean is useful in comparing values with different ranges. It denotes a central value for the product of two probabilities, rather than the middle value in an arithmetic way.
$$
\sum_{i=1}^n\sqrt{p_iq_i}
$$
In comparing two probability distributions, the probability of an event or outcome for both distributions are plugged in the formula and compared. When there is no overlap (either or both values are 0) the maximum distance is assigned. When components are non-empty there is a certain overlap and a distance is calculated.
From this point I do not understand why the geometric mean is subtracted from 1 and a second square root is taken. I do know that the formula is very powerful in high-dimensional data or skewed distributions through class imbalances. The distance metric should be insensitive the skewed date. For example, within computer sciences one application of hellinger distance is anomaly detection.
Hopefully, someone else can contribute to answer these open questions.
Cha, Sung-Hyuk, Comprehensive Survey on Distance/Similarity Measures between Probability Density Functions, 2007.
Best Answer
Assume that the random variable $X$ is such that $X=\dfrac{p_i}{q_i}$ with probability $q_i$, for every $i$. Then, $$ h(p\mid\mid q)=\sum\limits_iq_i\frac{p_i}{q_i}\ln\left(\frac{p_i}{q_i}\right)=\mathrm E(X\ln X). $$ Since the function $x\mapsto x\ln x$ is convex, $\mathrm E(X\ln X)\geqslant \mathrm E(X)\ln\mathrm E(X)$ by Jensen inequality.
To complete the proof one must simply compute $\mathrm E(X)$, and I will let you do that.