Solved – Calculate median for count data

medianr

I've got data that looks like this:

            Salary category  Count      Sex Year   Profession
1                   aa 0,00    842        M 2014   Arts (= doctor)
2       ab 0,01 / - 2500,00    454        M 2014   Arts
3    ac 2500,00 / - 5000,00    256        M 2014   Arts
4    ad 5000,00 / - 7500,00    218        M 2014   Arts
5   ae 7500,00 / - 10000,00    222        M 2014   Arts
6  af 10000,00 / - 12500,00    245        M 2014   Arts
7  ag 12500,00 / - 15000,00    266        M 2014   Arts
8  ah 15000,00 / - 17500,00    289        M 2014   Arts
9  ai 17500,00 / - 20000,00    250        M 2014   Arts
10 aj 20000,00 / - 22500,00    268        M 2014   Arts
11 ak 22500,00 / - 25000,00    277        M 2014   Arts
12 al 25000,00 / - 27500,00    344        M 2014   Arts
13 am 27500,00 / - 30000,00    473        M 2014   Arts
14 an 30000,00 / - 32500,00    502        M 2014   Arts
...

So I have a bunch of salary categories, sex, the year in which the salary was reported, and profession (doctor or veterinarian). I'm interested in differences in income. Given the way the data are arranged, it's a bit tricky to just fit a model. I'd like something that looks like this:

 Income ~ (sex + profession + year)^3

One way we thought up of reaching an dependent variable that was easier to work with, was to take the middle point of each interval (each salary category), multiply it by the count, adding all these values up for each category of sex, year and profession, and dividing it by the total number of people in the category. That way, we get a mean income per category (e.g. male vets in 2003).

Question 1) Is this a valid approach? Are there better ways of going about this?

Second, instead of taking the mean for each category, we were advised to look at median values instead. The data are indeed skewed, so this might be a better approach. But getting a median out of a bunch of count data is pretty complex. Therefore:

Question 2) Suppose we use the midpoint of each salary interval, how would we go about calculating a median value for each category in R? It's straightforward in essence, but I'm having trouble whipping up code to do it automatically for each category.

Best Answer

Question 1) All approximations are 'valid' in some sense and 'invalid' in another sense. Rather than looking at validity, it helps to have an idea of what specific problem this approximation will cause you.

For example, suppose you knew that income was actually uniformly distributed within each group (that is, taking the 256 people in line 3, one earns 2,500, another 2,510, another 2,520, and so on). Then collapsing each group down to the midpoint will only barely affect the overall slope (because the midpoint is roughly equal to the mean), but will dramatically affect the estimate of $R^2$ and slope uncertainty.

However, if you have data that's skewed within each group (suppose the underlying income distribution is exponential, for example), then using the midpoint as a proxy of the mean will overestimate the actual mean for the group, shifting the data rightward. If it's exponential, the overestimation will be roughly the same for each group--and so the intercept is affected, but not the slope. If it's a distribution where the difference between midpoint and mean varies by region, then it could also affect the slope.

Your second part of this question seems unclear to me--how are you distinguishing between midpoint, mean, and median? It looks to me like you only have access to the first, and if you're estimating the median, you're likely using a distribution that you're better off using directly.

Which leads to:


The approach I would try is to come up with some underlying model. Maybe incomes are a mixture of a lognormal and a point mass at \$0. For any parameter vector for that distribution (here a triplet with $\mu$, $\theta$, and $p_0$), we can calculate the probability that a sample from that distribution will have the counts in the table. Find the MLE, and you're done.

But it looks like we want to estimate those parameters from the category labels--that is, we expect $\mu$ to depend on sex and year and so on. So then we can either fit a model to the MLE parameters (easy) or do a joint optimization for total likelihood (somewhat harder, but still doable).

Related Question