The basic results of chi-square goodness-of-fit testing can be understood hierarchically.
Level 0. The classical Pearson's chi-square test statistic for testing a multinomial sample against a fixed probability vector $p$ is
$$
X^2(p) = \sum_{i=1}^k \frac{(X^{(n)}_i - n p_i)^2}{n p_i} \stackrel{d}{\to} \chi_{k-1}^2 \>,
$$
where $X_i^{(n)}$ denotes the number of outcomes in the $i$th cell out of a sample of size $n$. This can be fruitfully viewed as the squared norm of the vector $\mathbf Y_n = (Y_1^{(n)},\ldots,Y_k^{(n)})$ where $Y_i^{(n)} = (X_i^{(n)} - n p_i)/\sqrt{n p_i}$ which, by the multivariate central limit theorem converges in distribution as
$$
\mathbf Y_n \stackrel{d}{\to} \mathcal N(0, \mathbf I - \sqrt{p}\sqrt{p}^T) \>.
$$
From this we see that $X^2 = \|\mathbf Y_n\|^2 \to \chi^2_{k-1}$ since $\mathbf I - \sqrt{p}\sqrt{p}^T$ is idempotent of rank $k-1$.
Level 1. At the next level of the hierarchy, we consider composite hypotheses with multinomial samples. Since the exact $p$ of interest is unknown under the null hypothesis, we have to estimate it. If the null hypothesis is composite and composed of a linear subspace of dimension $m$, then maximum likelihood estimates (or other efficient estimators) of the $p_i$ can be used as "plug-in" estimators. Then, the statistic
$$
X^2_1 = \sum_{i=1}^k \frac{(X^{(n)}_i - n \hat{p}_i)^2}{n \hat{p}_i} \stackrel{d}{\to} \chi_{k-m - 1}^2 \>,
$$
under the null hypothesis.
Level 2. Consider the case of goodness of fit testing of a parametric model where the cells are fixed and known in advance: For example, we have a sample from an exponential distribution with rate $\lambda$ and from this we produce a multinomial sample by binning over $k$ cells, then the above result still holds provided that we use efficient estimates (e.g., MLEs) of the bin probabilities themselves using only the observed frequencies.
If the number of parameters for the distribution is $m$ (e.g., $m = 1$ in the exponential case), then
$$
X^2_2 = \sum_{i=1}^k \frac{(X^{(n)}_i - n \hat{p}_i)^2}{n \hat{p}_i} \stackrel{d}{\to} \chi_{k-m - 1}^2 \>,
$$
where here $\hat{p}_i$ can be taken to be the MLEs of the cell probabilities of the fixed, known cells corresponding to the given distribution of interest.
Level 3. But, wait! If we have a sample $Z_1,\ldots,Z_n \sim F_\lambda$, why shouldn't we estimate $\lambda$ efficiently first, and then use a chi-square statistic with our fixed, known cells? Well, we can, but in general we no longer get a chi-square distribution for the corresponding chi-square statistic. In fact, Chernoff and Lehmann (1954) showed that using MLEs to estimate the parameters and then plugging them back in to get estimates of the cell probabilities results in a non-chi-square distribution, in general. Under suitable regularity conditions, the distribution is (stochastically) between a $\chi_{k-m-1}^2$ and a $\chi_{k-1}^2$ random variable, with the distribution depending on the parameters.
Untuitively, this means that the limiting distribution of $\mathbf Y_n$ is $\mathcal N(0, \mathbf I - \sqrt{p_\lambda}\sqrt{p_\lambda}^T - \mathbf A(\lambda))$.
We haven't even talked about random cell boundaries yet, and we're already in a bit of a tight spot! There are two ways out: One is to retreat back to Level 2, or at the very least not use efficient estimators (like MLEs) of the underlying parameters $\lambda$. The second approach is to try to undo the effects of $\mathbf A(\lambda)$ in such a way as to recover a chi-square distribution.
There are several ways of going the latter route. They basically amount to premultiplying $\mathbf Y_n$ by the "right" matrix $\mathbf B(\hat{\lambda})$. Then, the quadratic form
$$
\mathbf Y_n^T \mathbf B^T \mathbf B \mathbf Y_n \stackrel{d}{\to} \chi_{k-1}^2 \>,
$$
where $k$ is the number of cells.
Examples are the Rao–Robson–Nikulin statistic and the Dzhaparidze–Nikulin statistic.
Level 4. Random cells. In the case of random cells, under certain regularity conditions, we end up in the same situation as in Level 3 if we take the route of modifying the Pearson chi-square statistic. Location-scale families, in particular, behave very nicely. One common approach is to take our $k$ cells each to have probability $1/k$, nominally. So, our random cells are intervals of the form $\hat{I}_j = \hat \mu + \hat\sigma I_{0,j}$ where $I_{0,j} = [F^{-1}((j-1)/k), F^{-1}(j/k))$. This result has been further extended to the case where the number of random cells grows with the sample size.
References
A W. van der Vaart (1998), Asymptotic Statistics, Cambridge University Press. Chapter 17: Chi-Square Tests.
H. Chernoff and E. L. Lehmann (1954), The use of maximum likelihood estimates in $\chi^2$ tests for goodness of fit, Ann. Math. Statist., vol. 25, no. 3, 579–586.
F. C. Drost (1989), Generalized chi-square goodness-of-fit tests for location-scale models when the number of classes tends to infinity, Ann. Stat, vol. 17, no. 3, 1285–1300.
M. S. Nikulin, M.S. (1973), Chi-square test for continuous distribution with
shift and scale parameters, Theory of Probability and its Application, vol. 19, no. 3, 559–568.
K. O. Dzaparidze and M. S. Nikulin (1973), On a modification of the standard statistics of Pearson, Theory of Probability and its Application, vol. 19, no. 4, 851–853.
K. C. Rao and D. S. Robson (1974), A chi-square statistic for goodness of fit tests within exponential family, Comm. Statist., vol 3., no. 12, 1139–1153.
N. Balakrishnan, V. Voinov and M. S. Nikulin (2013), Chi-Squared Goodness of Fit Tests With Applications, Academic Press.
Best Answer
@Silverfish asked for an expansion of the answer by PolatAlemdar, which was not given, so I will try to expand on it here.
Why the name chisquare distance? The chisquare test for contingency tables is based on $$ \chi^2 = \sum_{\text{cells}} \frac{(O_i-E_i)^2}{E_i} $$ so the idea is to keep this form and use it as a distance measure. This gives the third formula of the OP, with $x_i$ interpreted as observation and $y_i$ as expectation, which explains PolatAlemdar's comment "It is used in discrete probability distributions", as for instance in goodness of fit testing. This third form is not a distance function, as it is asymmetric in the variables $x$ and $y$. For histogram comparison, we will want a distance function which is symmetric in $x$ and $y$, and the two first forms give this. The difference between them is only a constant factor $\frac12$, which is unimportant as long as you just chooses one form consistently (though the version with extra factor $\frac12$ is better if you want to compare with the asymmetric form). Note the similarity in these formulas with squared euclidean distance, that is not coincidence, chisquare distance is a kind of weighted euclidean distance. For that reason, the formulas in the OP is usually put under a root sign to get distances. In the following we follow this.
Chisquare distance is used also in correspondence analysis. To see the relationship to the form used there, let $x_{ij}$ be the cells of a contingency table with $R$ rows and $C$ columns. Denote the row totals be $x_{+j}=\sum_i x_{ij}$ and the column totals by $x_{i+}=\sum_j x_{ij}$. The the chisquare distance between rows $l,k$ is given by $$ \chi^2(l,k) = \sqrt{\sum_j \frac1{x_{+j}}\left(\frac{x_{lj}}{x_{l+}}-\frac{x_{kj}}{x_{k+}} \right)^2 } $$ For the case with only two rows (the two histograms) these recovers the OP's first formula (modulo the root sign).
Answering to question in comments below: A book with long discussions of the chisquare distance is "CORRESPONDENCE ANALYSIS in PRACTICE (Second Edition)" by Michael Greenacre (Chapman & Hall). It is a well established name, coming from its similarity to chisquare as used with contingency tables. What distribution does it have? I have never studied that, but probably (under some conditions ...) it would have some chisquare distribution, approximately. Proofs should be similar to what is done with contingency tables, most literature about correspondence analysis does not go into distribution theory. A paper having some, maybe relevant such theory is ALTERNATIVE METHODS TO MULTIPLE CORRESPONDENCE ANALYSIS IN RECONSTRUCTING THE RELEVANT INFORMATION IN A BURT'S TABLE. Also see this other posts for some other relevant posts on this site.