Solved – Draw a histogram with normal distribution overlay

normal distribution

I was asked to draw a histogram with normal distribution overlay over our data and I'm quite a noob in statistics and require help in this. Our data is an array of floating point values, and the histogram should show the distribution of those. I wrote a small piece of code that does this:

  1. Split all my values into buckets
  2. Find all values that happen to be inside each bucket
  3. Calculate the number of items in the bucket and divide them on the number of the items overall and on the width of the column
  4. Show what I have calculated in (3) as histogram
  5. Calculate $\mu$ as $\text{avg}(\text{values})$
  6. Calculate $\sigma^2$ as $\text{avg}([(\text{each value} – \mu)^2])$
  7. Draw overlay with formula:
    $$\dfrac{1}{\sqrt{2\pi\sigma^2}}e^\dfrac{-(x – \mu)^2}{2\sigma^2}$$

But my result looks weird:
My result

I'd expect that normal distribution overlay should be higher and I've probably calculated something incorrectly. Am I right that it looks wrong, and where did I miss if I did?

BTW, I'm calculating $\mu$ and $\sigma^2$ over my original values, not their counts in buckets. Is it correct?

Update In case if anybody will try to use the algo I described here:
I found a mistake. When I was counting the height of each column in histogram, I didn't divide by the width of each column, so I was not computing a density. Fixed that in the description.

It's funny that this particular example hasn't changed and is correct, as its column's width is always 1. So I just chose a bad example.

Best Answer

That curve looks fine to me, in the sense that the curve looks to me like the best possible fit of a normal distribution to your data. Though "least bad" fit would probably be a better way of describing it.

I suspect what is going on is that the large bin > 30 increased the variance, thus making the normal curve wider and flatter than your histogram.

Related Question