The basic results of chi-square goodness-of-fit testing can be understood hierarchically.
Level 0. The classical Pearson's chi-square test statistic for testing a multinomial sample against a fixed probability vector $p$ is
$$
X^2(p) = \sum_{i=1}^k \frac{(X^{(n)}_i - n p_i)^2}{n p_i} \stackrel{d}{\to} \chi_{k-1}^2 \>,
$$
where $X_i^{(n)}$ denotes the number of outcomes in the $i$th cell out of a sample of size $n$. This can be fruitfully viewed as the squared norm of the vector $\mathbf Y_n = (Y_1^{(n)},\ldots,Y_k^{(n)})$ where $Y_i^{(n)} = (X_i^{(n)} - n p_i)/\sqrt{n p_i}$ which, by the multivariate central limit theorem converges in distribution as
$$
\mathbf Y_n \stackrel{d}{\to} \mathcal N(0, \mathbf I - \sqrt{p}\sqrt{p}^T) \>.
$$
From this we see that $X^2 = \|\mathbf Y_n\|^2 \to \chi^2_{k-1}$ since $\mathbf I - \sqrt{p}\sqrt{p}^T$ is idempotent of rank $k-1$.
Level 1. At the next level of the hierarchy, we consider composite hypotheses with multinomial samples. Since the exact $p$ of interest is unknown under the null hypothesis, we have to estimate it. If the null hypothesis is composite and composed of a linear subspace of dimension $m$, then maximum likelihood estimates (or other efficient estimators) of the $p_i$ can be used as "plug-in" estimators. Then, the statistic
$$
X^2_1 = \sum_{i=1}^k \frac{(X^{(n)}_i - n \hat{p}_i)^2}{n \hat{p}_i} \stackrel{d}{\to} \chi_{k-m - 1}^2 \>,
$$
under the null hypothesis.
Level 2. Consider the case of goodness of fit testing of a parametric model where the cells are fixed and known in advance: For example, we have a sample from an exponential distribution with rate $\lambda$ and from this we produce a multinomial sample by binning over $k$ cells, then the above result still holds provided that we use efficient estimates (e.g., MLEs) of the bin probabilities themselves using only the observed frequencies.
If the number of parameters for the distribution is $m$ (e.g., $m = 1$ in the exponential case), then
$$
X^2_2 = \sum_{i=1}^k \frac{(X^{(n)}_i - n \hat{p}_i)^2}{n \hat{p}_i} \stackrel{d}{\to} \chi_{k-m - 1}^2 \>,
$$
where here $\hat{p}_i$ can be taken to be the MLEs of the cell probabilities of the fixed, known cells corresponding to the given distribution of interest.
Level 3. But, wait! If we have a sample $Z_1,\ldots,Z_n \sim F_\lambda$, why shouldn't we estimate $\lambda$ efficiently first, and then use a chi-square statistic with our fixed, known cells? Well, we can, but in general we no longer get a chi-square distribution for the corresponding chi-square statistic. In fact, Chernoff and Lehmann (1954) showed that using MLEs to estimate the parameters and then plugging them back in to get estimates of the cell probabilities results in a non-chi-square distribution, in general. Under suitable regularity conditions, the distribution is (stochastically) between a $\chi_{k-m-1}^2$ and a $\chi_{k-1}^2$ random variable, with the distribution depending on the parameters.
Untuitively, this means that the limiting distribution of $\mathbf Y_n$ is $\mathcal N(0, \mathbf I - \sqrt{p_\lambda}\sqrt{p_\lambda}^T - \mathbf A(\lambda))$.
We haven't even talked about random cell boundaries yet, and we're already in a bit of a tight spot! There are two ways out: One is to retreat back to Level 2, or at the very least not use efficient estimators (like MLEs) of the underlying parameters $\lambda$. The second approach is to try to undo the effects of $\mathbf A(\lambda)$ in such a way as to recover a chi-square distribution.
There are several ways of going the latter route. They basically amount to premultiplying $\mathbf Y_n$ by the "right" matrix $\mathbf B(\hat{\lambda})$. Then, the quadratic form
$$
\mathbf Y_n^T \mathbf B^T \mathbf B \mathbf Y_n \stackrel{d}{\to} \chi_{k-1}^2 \>,
$$
where $k$ is the number of cells.
Examples are the Rao–Robson–Nikulin statistic and the Dzhaparidze–Nikulin statistic.
Level 4. Random cells. In the case of random cells, under certain regularity conditions, we end up in the same situation as in Level 3 if we take the route of modifying the Pearson chi-square statistic. Location-scale families, in particular, behave very nicely. One common approach is to take our $k$ cells each to have probability $1/k$, nominally. So, our random cells are intervals of the form $\hat{I}_j = \hat \mu + \hat\sigma I_{0,j}$ where $I_{0,j} = [F^{-1}((j-1)/k), F^{-1}(j/k))$. This result has been further extended to the case where the number of random cells grows with the sample size.
References
A W. van der Vaart (1998), Asymptotic Statistics, Cambridge University Press. Chapter 17: Chi-Square Tests.
H. Chernoff and E. L. Lehmann (1954), The use of maximum likelihood estimates in $\chi^2$ tests for goodness of fit, Ann. Math. Statist., vol. 25, no. 3, 579–586.
F. C. Drost (1989), Generalized chi-square goodness-of-fit tests for location-scale models when the number of classes tends to infinity, Ann. Stat, vol. 17, no. 3, 1285–1300.
M. S. Nikulin, M.S. (1973), Chi-square test for continuous distribution with
shift and scale parameters, Theory of Probability and its Application, vol. 19, no. 3, 559–568.
K. O. Dzaparidze and M. S. Nikulin (1973), On a modification of the standard statistics of Pearson, Theory of Probability and its Application, vol. 19, no. 4, 851–853.
K. C. Rao and D. S. Robson (1974), A chi-square statistic for goodness of fit tests within exponential family, Comm. Statist., vol 3., no. 12, 1139–1153.
N. Balakrishnan, V. Voinov and M. S. Nikulin (2013), Chi-Squared Goodness of Fit Tests With Applications, Academic Press.
I'm going to venture an answer to my own question after some googling. One simple approach is to use binned Poisson maximum likelihood ratios. See p. 94-96 of this page:
http://www.hep.phy.cam.ac.uk/~thomson/lectures/statistics/FittingHandout.pdf
The likelihood ratio converges to a $\chi^2$ distribution in the large count limit, and if you're dealing with very few counts, you should do MC simulations to determine the empirical distribution of the likelihood ratio under the hypothesis that your histogram does indeed represent a collection of samples from the model distribution. You can determine a $p$-value from this simulated likelihood ratio distribution, and the $\chi^2$ test just represents a fast analytical approximation to this which is applicable in the large count limit.
All of this is nothing more than following the "perennial philosophy" of frequentist statistics:
If you think something interesting happened, you should find out how often you'd expect that thing to happen by chance before you go proclaiming to the world that it's interesting.
If the $p$-value shows that random effects are almost surely not responsible for the difference between your model and your observation, then your model is probably wrong.
Best Answer
The biggest problem is that an averaged shifted histogram has positive dependence in adjacent bins, so a test derived on an independence assumption (aside the negative dependence induced by the total count being conditioned on, which is adjusted for) won't have the right distribution for its test statistic.
It's possible to adapt a test for such dependence, but the vanilla version of the test will be wrong.
[If you want to test for normality, doing it from a histogram isn't a particularly good way to do it. A Shapiro-Wilk or Shapiro-Francia test, an Anderson-Darling test, or perhaps a Smooth test of the kind discussed in Rayner and Best's book Smooth Tests of Goodness of Fit would be better. The nice thing about a Shapiro-Francia test is it's just based on the correlation in a normal scores plot (Q-Q plot for normality), which gives a visual assessment of the non-normality]
--
Edit - looking at your QQ plot - the data are very far from normal. No reasonable test would fail to reject normality at that sample size. A Lilliefors test or an Anderson-Darling or a Shapiro-Wlik or a smooth test with a standard number of terms ($k=4$ or $k=6$) will all reject easily... you don't even need to test that.