Data Visualization – Histogram and Distribution Fitting for Datasets with Unequal and Open-Ended Intervals

data visualizationdensity functiondistributionshistogram

I have this income distribution data for various groups:
https://docs.google.com/spreadsheet/ccc?key=0Akwg3n_e05cCdEdtT0VZYU5keW5DVkNoNmpBWmdzeUE

As you can see, I have intervals/bins with varying widths. I also have an open-ended interval and the mean income in that open interval varies a lot between groups.

For pedagogical purposes, I'd like to be able to calculate how many people and/or percent is in a given interval, given an equal bin width.

Example:
On an x-axis from 0-2000 with a bin width of 10, I'd like to be able to say how many people is in the 590-600 group.

1) Is this doable?
2) Do I need to fit a distribution to a histogram or how do I do it?
2b) What would you base this histogram / distribution on (mean, median etc.)? (As mentioned, the mean in the open-ended interval varies a lot between groups, something I'd like to take into consideration = show truthfully graphically.)
3) Can you please explain this in simple mathematical operations based on the data given?

Best Answer

There are lots of possible data sets that could generate these summary bins, so it's impossible to be exact, but you can make reasonable guesses.

One way to get subinterval estimates is to create a function that gives the number of people at each income level. The easiest, and perhaps the best (simplest assumptions), is to connect known points and interpolate between them. You don't really have known points, but I used the (x=median, y=intervalCount/intervalWidth). There's not much difference between the mean and medium in this set, which suggests the data values are pretty well-behaved in each interval.

Once you have such a function, you can integrate it between any two points to get any subinterval counts.

Connected Line Plot

I left out the 0-0 interval because the value is literally off the chart and 1000+ because it has no real width.

Since the data is obviously not any traditional distribution, a local smoother is a decent way to smooth it out. Here's a spline smoother:

Spline Smoother Plot

It does better at the tail, but is perhaps too smooth at the beginning.

The 100-119 interval looks high in both populations. It could be due to a propensity for people to round up to 100 when answering the survey.

As far are truth in graphics goes, it best to just plot the data that you have, which is the intervals. It might be useful to show the mean/medians, but they only depart from the middle for the high ranges, which might be worth separate study.

Income bins

We can try in double our bin count by considering the medians. Theoretically, the median divides each interval into two intervals with equal population (two bars of equal area but possibly different heights). However, the breakdown is not so obvious due to possible ties and fractional medians. Here is it with interval widths of (median-lo) and (hi-median+1): (each full interval width is (hi-lo+1)).

Income Half Bins