[Math] the proper way to compare discrete set of data to a continuous probability distribution

probabilitystatistics

I have a set of data points that I an trying to approximate with a few different probability distributions such as the gaussian or student t distributions to see which fits best.

The first step to doing this is plotting the distribution of points themselves. As per my last question, I centered the points around zero and normalized the variance = 1. Now I would like to plot these points to compare to continuous distributions. How does one do this?

I am currently using a histogram method in which I split up the points into $x$ buckets and plot the height of each bucket. Now I realize this doesn't make sense because the resulting distribution of points in the histogram is entirely dependent on the value I choose for $x$.

What is the proper way of doing this? There must be a fundamental concept I'm missing.

Best Answer

It sounds to me like your primary difficulty is in binning the data properly. Please pardon me if this is not your question.

You want the total area of the rectangles comprising the histogram to be 1. If you take rectangle height to be the number of data points in the corresponding bin, then the total area will not be 1. Instead, you will have $$ \text{area}=(\text{bin width})\times(\text{total number of data points)}. $$ The general principle for handling distributions of binned data properly - which even allows you to deal with variable-width bins - is that rectangle area, rather than rectangle height, should be proportional to the number of data points in the bin. For the fixed-width case, the height of each rectangle should be computed as $$ \text{height}= \frac{(\text{number of data points in bin})}{(\text{total number of data points})\times(\text{width of bin})}. $$

If you follow this procedure, then, within reason, the resulting distribution should be relatively insensitive to the number of bins you decide to use. Of course, if you use too many bins, most bins will be empty and you'll get a very spiky distribution, which won't be very illuminating. If you use too few bins, the distribution will be too coarse-grained. The Wikipedia page on histograms describes some commonly used rules, such as Sturges' formula, for deciding how many bins to use.

For testing goodness of fit, you should follow the suggestions in Michael Chernick's answer.

Related Question