Calculating Uncertainties in Histogram Bins – Methods for Experimental Data with Known Measurement Errors

I have a set of experimental data (with each data-point having its own measured uncertainty), and I wish to produce a histogram of it. The x values of the edges of each bin are already defined. The trick is that I need to have uncertainties for the value of each bin, since I am then going to fit a model-histogram to it. (The model is of a physical process, the outcome of which is best described by a histogram. The model will be fit using a non-linear least squares algorithm, and I want to weight each bin based on its uncertainty).

The uncertainties of each histogram bin need to depend on both the known uncertainties associated with each data-point within the bin, and also the number of data-points within the bin. This is where I am stuck – how can I calculate this?

Best Answer

It sounds like you want to calculate a standard error for the unobserved count (i.e. counts of values without the error) in each bin.

For each bin you can calculate the probability that a given observation ($x_i^\text{obs}$ with associated standard deviation $\sigma_i$) could have come from any given bin.

So the number of observations actually in some specific bin, say bin $j$, is the sum of a collection of $\text{Bernoulli}(p_i(j))$ random variables, where $p_i$ for a given bin is the proportion of the area under a normal distribution $N(x_i,\sigma_i^2)$ within the bin boundaries of the $j$-th bin.

If the Bernoulli observations are in his would imply the standard error of the total count is

$$\sum_{i=1}^n p_i(j)(1-p_i(j))$$


$$p_i(j) = \int_{l_j}^{u_j} \frac{1}{\sqrt{2\pi}\sigma_i} e^{-\frac{(x_i-z)^2}{2\sigma_i^2}}\, dz$$

where $l$ and $u$ represent upper and lower bin boundaries, and so $p_i(j)$ may be written as the differences of two normal cdf values.

Under the assumption that the different observations' contributions to the count in a given bin are independent, the distribution of the unobserved "true" count in a given bin would be distributed as Poisson-binomial, but I don't think we need to use that for anything, and - while we can work out the correlation between bin counts - I don't think we need that if your interest is on the individual per-bin standard errors.