Solved – Error Bars for Histogram with Uncertain Data

density-estimationerrorhistogramuncertainty

Context

I have a set of data points $\{x_1, \dots, x_N \}$ along with the respective measurement uncertainties $\{\epsilon_1, \dots, \epsilon_N\}$ in them ($N \approx 100$). These data are basically the measured distances to the occurrences of some astrophysical process, and I am trying to estimate the spatial distribution of these events without assuming any model (because I really don't have a reasonable model). So to do that, I built a histogram out of my data with bins of equal size $\{B_0, \dots, B_M\}$, and now I want to also put some error bars on my histogram, with my measurement uncertainties taken into account. But after I have looked around for how to do this, I got even more confused.

(I don't have much experience with statistics, so the real problems may just be my lack of understanding in statistics.)

Histogram with no measurement uncertainty

First of all, I found that I don't seems to even understand what these error bars suppose to mean. Let's first ignore the $\epsilon_i$'s and compute the error of a histogram of "perfect data". I have come across the following calculation in several different places:

Denote the number of data points fall in bin $B_k$ correspondingly as $N_k$. We estimate the probability of fall in this bin as $p_k = \frac{N_k}{N}$. Then since we can thought of $N_k$ as a sum of Bernoulli variable $Ber(p_k)$, the variance of $N_k$ is just $\sigma^2[N_k] = Np_k(1-p_k) = N_k(1-\frac{N_k}{N})$. For large enough $N$, we can ignore the second term, and we have the error bar $\sigma_k = \sqrt{N_k}$.

But I don't understand:

  1. I saw people often refer this as a "Poisson noise", but I am not sure if I see where is that underlying Poisson process generating this Poisson noise.

  2. This also suggest that bins with zero count has no error, which doesn't sound right to me. Indeed, I have come across this article discussing exactly what's wrong with assigning a Poisson error bar $\sigma_k = \sqrt{N_k}$. In particular, the author says

If we observe N, that measurement has NO uncertainty: that is what we saw, with 100% probability. Instead, we should apply a paradigm shift, and insist that the uncertainty should be drawn around the model curve we want to compare our data points to, and not around the data points!

But that doesn't sound right neither. While my measurements are deterministic numbers (ignoring measurement uncertainty), I am trying to estimate a distribution using a finite sample, so there still got to be uncertainty associated with my estimation. So what should be the correct way to understand these issues?

  1. I have also been suggested to use bootstrapping to estimate these error bars, but again I don't quite understand why should it work. If $N_k=0$ for my original data set, no matter how I resample my data, I will always have zero count in $B_k$, so I am again forced to conclude that $p_k = 0$ with zero uncertainty. So intuitively I don't see how bootstrapping my data can give me any new insight about my distribution estimate. Well, it may just be that I don't understand how resampling methods work in general.

Histogram with measurement uncertainty

Coming back to my original problem. I did find some answers about how to put in measurement uncertainties such as in this answer. The method basically is to find the probability $q_i(B_k)$ of the $i$-th data point falling in bin $B_k$ assuming the $i$-th measurement is normal distributed with $\mathcal{N}(x_i, \epsilon_i^2)$:

$$ q_i(B_k) = \int_{B_k} \frac{1}{\sqrt{2\pi}\epsilon_i} e^{-\frac{(x-x_i)^2}{2\epsilon_i^2}} \ dx$$

And then use these $q_i(B_k)$ to construct the Bernoulli variance in $B_k$ as

$$ \sum_{i=1}^{N} q_i(B_k)(1 – q_i(B_k)) $$

But my question is, where does that "Poisson noise" go in this method? The bin count $N_k$ doesn't even show up anymore, and this make me feels like something is missing. Or maybe I have overlooked something.

So I guess what I really want, is to see a complete treatment of error estimation for histogram, which I couldn't find anywhere.

Best Answer

I thought about it some more, and I have a couple of ideas.

(1) About measurement uncertainty: from what you said, it's big enough to take into account. I agree with the formula for qi -- this is just the mass of the distribution for x[i] which falls into B[k]. From that, it looks to me that the mean of the proportion of x which falls into B[k] (let's call that q(B[k])) is the sum of those bits over all the data, i.e., q(B[k]) = sum(qi, i, 1, N). Then the height of the histogram bar k is q(B[k]). and its variance is q(B[k])*(1 - q(B[k])).

So I disagree about the variance -- I think the summation over i should be inside q in variance = q*(1 - q), not outside.

It occurs to me that you'll want to ensure that the q(B[k]) sum to 1 -- maybe that's guaranteed by construction. In any event you'll want to verify that. EDIT: Also, as the measurement error becomes smaller and smaller, you should find that the q(B[k]) converges to the simple n[k]/sum(n[k]) estimate.

(2) About prior information about nonempty bins, I recall that adding a fixed number to the numerator and denominator in n[k]/n, i.e., (n[k] + m[k])/(n + sum(m[k])), is equivalent to assuming a prior over the bin proportion, with the prior mean being m[k]/sum(m[k]). As you can see, the larger m[k], the stronger the influence of the prior. (This business about the prior count is equivalent to assuming a conjugate prior for the bin proportion -- "conjugate prior beta binomial" is a topic you can look up.)

Since q(B[k]) is not just a proportion of counts, it's not immediately clear to me how to incorporate the prior count. Maybe you need (q(B[k]) + m[k])/Z where Z is whatever makes the adjusted proportions sum to 1.

However, I don't know how hard you should try to fix up the bin proportions. You were saying you don't have enough prior information to pick a parametric distribution -- if so, maybe you also don't have enough to make assumptions about bin proportions. That's a kind of higher-level question you can consider.

Good luck and have fun, it seems like an interesting problem.