No unbiased estimator either of $\mathfrak{H}$ or of $\mathfrak{H}^2$ exists for $f$ from any reasonably broad nonparametric class of distributions.
We can show this with the beautifully simple argument of
Bickel and Lehmann (1969). Unbiased estimation in convex families. The Annals of Mathematical Statistics, 40 (5) 1523–1535. (project euclid)
Fix some distributions $F_0$, $F$, and $G$, with corresponding densities $f_0$, $f$, and $g$.
Let $H(F)$ denote $\mathfrak{H}(f, f_0)$,
and let $\hat H(\mathbf X)$ be some estimator of $H(F)$ based on $n$ iid samples $X_i \sim F$.
Suppose that $\hat H$ is unbiased for samples from any distribution of the form
$$M_\alpha := \alpha F + (1 - \alpha) G
.$$
But then
\begin{align}
Q(\alpha)
&= H(M_\alpha)
\\&= \int_{x_1} \cdots \int_{x_n} \hat H(\mathbf X) \,\mathrm{d}M_\alpha(x_1) \cdots\mathrm{d}M_\alpha(x_n)
\\&= \int_{x_1} \cdots \int_{x_n} \hat H(\mathbf X) \left[ \alpha \mathrm{d}F(x_1) + (1-\alpha) \mathrm{d}G(x_1) \right] \cdots \left[ \alpha \mathrm{d}F(x_n) + (1-\alpha) \mathrm{d}G(x_n) \right]
\\&= \alpha^n \operatorname{\mathbb{E}}_{\mathbf X \sim F^n}[ \hat H(\mathbf X)] + \dots + (1 - \alpha)^n \operatorname{\mathbb{E}}_{\mathbf X \sim G^n}[ \hat H(\mathbf X)]
,\end{align}
so that $Q(\alpha)$ must be a polynomial in $\alpha$ of degree at most $n$.
Now, let's specialize to a reasonable case and show that the corresponding $Q$ is not polynomial.
Let $F_0$ be some distribution which has constant density on $[-1, 1]$: $f_0(x) = c$ for all $\lvert x \rvert \le 1$. (Its behavior outside that range doesn't matter.)
Let $F$ be some distribution supported only on $[-1, 0]$,
and $G$ some distribution supported only on $[0, 1]$.
Now
\begin{align}
Q(\alpha)
&= \mathfrak{H}(m_\alpha, f_0)
\\&= \sqrt{1 - \int_{\mathbb R} \sqrt{m_\alpha(x) f_0(x)} \mathrm{d}x}
\\&= \sqrt{1 - \int_{-1}^0 \sqrt{c \, \alpha f(x)} \mathrm{d}x -
\int_{0}^1 \sqrt{c \, (1 - \alpha) g(x)} \mathrm{d}x}
\\&= \sqrt{1 - \sqrt{\alpha} B_F - \sqrt{1 - \alpha} B_G}
,\end{align}
where $B_F := \int_{\mathbb R} \sqrt{f(x) f_0(x)} \mathrm{d}x$ and likewise for $B_G$.
Note that $B_F > 0$, $B_G > 0$ for any distributions $F$, $G$ which have a density.
$\sqrt{1 - \sqrt{\alpha} B_F - \sqrt{1 - \alpha} B_G}$ is not a polynomial of any finite degree.
Thus, no estimator $\hat H$ can be unbiased for $\mathfrak{H}$ on all of the distributions $M_\alpha$ with finitely many samples.
Likewise, because $1 - \sqrt{\alpha} B_F - \sqrt{1 - \alpha} B_G$ is also not a polynomial,
there is no estimator for $\mathfrak{H}^2$ which is unbiased on all of the distributions $M_\alpha$ with finitely many samples.
This excludes pretty much all reasonable nonparametric classes of distributions, except for those with densities bounded below (an assumption nonparametric analyses sometimes make). You could probably kill those classes too with a similar argument by just making the densities constant or something.
The Bhattacharyya coefficient is
$$
BC(h,g)= \int \sqrt{h(x) g(x)}\; dx
$$
in the continuous case. There is a good wikipedia article https://en.wikipedia.org/wiki/Bhattacharyya_distance. How to understand this (and the related distance)? Let us start with the multivariate normal case, which is instructive and can be found at the link above. When the two multivariate normal distributions have the same covariance matrix, the Bhattacharyya distance coincides with the Mahalanobis distance, while in the case of two different covariance matrices it does have a second term, and so generalizes the Mahalanobis distance. This maybe underlies claims that in some cases the Bhattacharyya distance works better than the Mahalanobis. The Bhattacharyya distance is also closely related to the Hellinger distance https://en.wikipedia.org/wiki/Hellinger_distance.
Working with the formula above, we can find some stochastic interpretation. Write
$$ \DeclareMathOperator{\E}{\mathbb{E}}
BC(h,g) = \int \sqrt{h(x) g(x)}\; dx = \\
\int h(x) \cdot \sqrt{\frac{g(x)}{h(x)}}\; dx = \E_h \sqrt{\frac{g(X)}{h(X)}}
$$
so it is the expected value of the square root of the likelihood ratio statistic, calculated under the distribution $h$ (the null distribution of $X$). That makes for comparisons with Intuition on the Kullback-Leibler (KL) Divergence, which interprets Kullback-Leibler divergence as expectation of the loglikelihood ratio statistic (but calculated under the alternative $g$). Such a viewpoint might be interesting in some applications.
Still another viewpoint, compare with the general family of f-divergencies, defined as, see Rényi entropy
$$
D_f(h,g) = \int h(x) f\left( \frac{g(x)}{h(x)}\right)\; dx
$$
If we choose $f(t)= 4( \frac{1+t}{2}-\sqrt{t} )$ the resulting f-divergence is the Hellinger divergence, from which we can calculate the Bhattacharyya coefficient. This can also be seen as an example of a Renyi divergence, obtained from a Renyi entropy, see link above.
Best Answer
This doesn't really answer the question but maybe helpful anyway.
All applications of the Hellinger distance I can think of are invariant to whether there's a factor 2 in the definition or not, potentially adjusting, e.g., threshold values by the same factor. Obviously whatever version is used needs to be used consistently, so it is advisable that the used formula is always explicitly given when using the Hellinger distance.
For this reason, most mathematicians would consider the two versions equivalent. There is no consensus about which one is right, and there is no authority that would enforce such a consensus. Most mathematicians would think that no such consensus is needed, as the two are "the same" in all relevant aspects anyway.
Historically, one possibility for such a situation to emerge (I don't know about the Hellinger distance in particular) is that somebody defines a concept originally, and somebody else discovers that the same concept (but multiplied with a constant factor not present in the original definition) emerges nicely out of some theoretical considerations that help a lot motivating the concept; after which for both versions there is a reason to be seen as legitimate.
Generally, a mathematical way of looking at such things is that names and notation should not be taken as having a generally agreed meaning but rather they should be explicitly defined when used and then they are what they are defined to be, in the specific place.
It has to be admitted though that there are limitations to this attitude. Work of a certain complexity cannot define everything from scratch for pragmatic reasons, and non-mathematicians are understandably often baffled by the same name apparently not referring to the (exactly) same thing. So certain conventions are required and some exist (too many from the point of view of some pure mathematicians; not enough from the point of view of many other people).
As another example, personally I am annoyed to see that the BIC and AIC as used for model selection are used in some literature in the positive and in other literature in the negative form, so in one case "larger is better", in another one "smaller is better" - for sure the authors need to tell the readers explicitly which version is used, but in many places this is not done, and the reader has to guess from looking at reported results which one it is.