Solved – Calculate median for count data

medianr

I've got data that looks like this:

            Salary category  Count      Sex Year   Profession
1                   aa 0,00    842        M 2014   Arts (= doctor)
2       ab 0,01 / - 2500,00    454        M 2014   Arts
3    ac 2500,00 / - 5000,00    256        M 2014   Arts
4    ad 5000,00 / - 7500,00    218        M 2014   Arts
5   ae 7500,00 / - 10000,00    222        M 2014   Arts
6  af 10000,00 / - 12500,00    245        M 2014   Arts
7  ag 12500,00 / - 15000,00    266        M 2014   Arts
8  ah 15000,00 / - 17500,00    289        M 2014   Arts
9  ai 17500,00 / - 20000,00    250        M 2014   Arts
10 aj 20000,00 / - 22500,00    268        M 2014   Arts
11 ak 22500,00 / - 25000,00    277        M 2014   Arts
12 al 25000,00 / - 27500,00    344        M 2014   Arts
13 am 27500,00 / - 30000,00    473        M 2014   Arts
14 an 30000,00 / - 32500,00    502        M 2014   Arts
...

So I have a bunch of salary categories, sex, the year in which the salary was reported, and profession (doctor or veterinarian). I'm interested in differences in income. Given the way the data are arranged, it's a bit tricky to just fit a model. I'd like something that looks like this:

 Income ~ (sex + profession + year)^3

One way we thought up of reaching an dependent variable that was easier to work with, was to take the middle point of each interval (each salary category), multiply it by the count, adding all these values up for each category of sex, year and profession, and dividing it by the total number of people in the category. That way, we get a mean income per category (e.g. male vets in 2003).

Question 1) Is this a valid approach? Are there better ways of going about this?

Second, instead of taking the mean for each category, we were advised to look at median values instead. The data are indeed skewed, so this might be a better approach. But getting a median out of a bunch of count data is pretty complex. Therefore:

Question 2) Suppose we use the midpoint of each salary interval, how would we go about calculating a median value for each category in R? It's straightforward in essence, but I'm having trouble whipping up code to do it automatically for each category.

Best Answer

Question 1) All approximations are 'valid' in some sense and 'invalid' in another sense. Rather than looking at validity, it helps to have an idea of what specific problem this approximation will cause you.

For example, suppose you knew that income was actually uniformly distributed within each group (that is, taking the 256 people in line 3, one earns 2,500, another 2,510, another 2,520, and so on). Then collapsing each group down to the midpoint will only barely affect the overall slope (because the midpoint is roughly equal to the mean), but will dramatically affect the estimate of $R^2$ and slope uncertainty.

However, if you have data that's skewed within each group (suppose the underlying income distribution is exponential, for example), then using the midpoint as a proxy of the mean will overestimate the actual mean for the group, shifting the data rightward. If it's exponential, the overestimation will be roughly the same for each group--and so the intercept is affected, but not the slope. If it's a distribution where the difference between midpoint and mean varies by region, then it could also affect the slope.

Your second part of this question seems unclear to me--how are you distinguishing between midpoint, mean, and median? It looks to me like you only have access to the first, and if you're estimating the median, you're likely using a distribution that you're better off using directly.

Which leads to:

The approach I would try is to come up with some underlying model. Maybe incomes are a mixture of a lognormal and a point mass at \$0. For any parameter vector for that distribution (here a triplet with $\mu$, $\theta$, and $p_0$), we can calculate the probability that a sample from that distribution will have the counts in the table. Find the MLE, and you're done.

But it looks like we want to estimate those parameters from the category labels--that is, we expect $\mu$ to depend on sex and year and so on. So then we can either fit a model to the MLE parameters (easy) or do a joint optimization for total likelihood (somewhat harder, but still doable).

Related Solutions

Solved – How to calculate median of distributed data

I first assume you do not have the possibility to save all chunks and just compute the median from all values. If you do but the values are un ordered I would recommend a selection algorithm to find the median, see Selection algorithm (wikipedia).

I also assume that the chunks don't contain elements in some sequential order such that the smallest elements come in one chunk and then the larger in the next etc. At which case you only need to find the median in the middle chunk.

I think you're looking for some kind of recursive estimator of the median but to find a good estimator for this is hard. I would recommend to use some kind of frequency count which you update for each new chunk of data giving you the possibility to get the median using these counts. Depending on the amount of possible values this might become unfeasible in terms of space. But depending on the data structure used you should be able to do this for most cases.

Solved – How to calculate the median age of a population

The trick here is understanding the grouping categories of age here. The age group "39" is people aged 39 to less than 40 (i.e. 39 - 39.99999999). So if you look at the cumulative total for the 38 year old group, you will see the value is 16,641,790 as you've correctly noted. The cumulative total for the 39 year old group, which is really the "39 to < 40 year old group) is 17,121,202. There are a total of 479,412 individuals who are older than 39 and less than 40. We know from your table that there are 16,641,790 individuals who are less than 39 years old. So starting at person 16,641,791 and going through person 17,121,202 we find our 39 to less than 40 year olds. There are 479,412 people in this group. By your faulty assumption of equal distribution of ages, if we take $479,412\over10$, we can construct the following table, roughly, by adding this amount to the end of the cumulative figure of the 38th age group in an iterative fashion:

Age     Begin       End 
39.0    16,641,791  16,689,731
39.1    16,689,732  16,737,672
39.2    16,737,673  16,785,614
39.3    16,785,615  16,833,555
39.4    16,833,556  16,881,496
39.5    16,881,497  16,929,437
39.6    16,929,438  16,977,378
39.7    16,977,379  17,025,320
39.8    17,025,321  17,073,261
39.9    17,073,262  17,121,202

According to these calculations, the median would fall somewhere around 39.7. Of course, birthday months and ages are not uniformly distributed, so this accounts for the 0.2 discrepancy we see between this "on the back of a napkin" calculation and the official statistic.

I hope this helps.

Best Answer

Related Solutions

Solved – How to calculate median of distributed data

Solved – How to calculate the median age of a population

Related Question