How to find the median (or approximate median) of n chunks of values. The values in each chunk can be assumed to have been sorted if it helps. In addition, the existence of a histogram for each chunk is allowed.
Solved – How to calculate median of distributed data
medianquantiles
Related Solutions
A weighted median, as defined in https://en.wikipedia.org/wiki/Weighted_median is easy to calculate. Let your observations be $Y_1, Y_2, \dots, Y_n$ with corresponding weights $w_1, w_2, \dots, w_n$. Then order the observations, let the order statistics be $Y_{(1)} \le Y_{(2)} \le \dots \le Y_{(n)}$ with corresponding weights $w_{(1)}, w_{(2)}, \dots, w_{(n)}$. Then sum the weights as such: $$ W(r) = \sum_{i=1}^r w_{(i)} $$ Now find the index $\hat{r}$ such that $$ W(\hat{r}-1)/W(n) \lt 0.5 \le W(\hat{r})/W(n) $$ and we can take $Y_{\hat{r}}$ as the weighted median, or we could maybe interpolate between $Y_{\hat{r}-1}$ and $Y_{\hat{r}}$.
As noted in comments, this is what you do when computing the median from a histogram.
Question 1) All approximations are 'valid' in some sense and 'invalid' in another sense. Rather than looking at validity, it helps to have an idea of what specific problem this approximation will cause you.
For example, suppose you knew that income was actually uniformly distributed within each group (that is, taking the 256 people in line 3, one earns 2,500, another 2,510, another 2,520, and so on). Then collapsing each group down to the midpoint will only barely affect the overall slope (because the midpoint is roughly equal to the mean), but will dramatically affect the estimate of $R^2$ and slope uncertainty.
However, if you have data that's skewed within each group (suppose the underlying income distribution is exponential, for example), then using the midpoint as a proxy of the mean will overestimate the actual mean for the group, shifting the data rightward. If it's exponential, the overestimation will be roughly the same for each group--and so the intercept is affected, but not the slope. If it's a distribution where the difference between midpoint and mean varies by region, then it could also affect the slope.
Your second part of this question seems unclear to me--how are you distinguishing between midpoint, mean, and median? It looks to me like you only have access to the first, and if you're estimating the median, you're likely using a distribution that you're better off using directly.
Which leads to:
The approach I would try is to come up with some underlying model. Maybe incomes are a mixture of a lognormal and a point mass at \$0. For any parameter vector for that distribution (here a triplet with $\mu$, $\theta$, and $p_0$), we can calculate the probability that a sample from that distribution will have the counts in the table. Find the MLE, and you're done.
But it looks like we want to estimate those parameters from the category labels--that is, we expect $\mu$ to depend on sex and year and so on. So then we can either fit a model to the MLE parameters (easy) or do a joint optimization for total likelihood (somewhat harder, but still doable).
Best Answer
I first assume you do not have the possibility to save all chunks and just compute the median from all values. If you do but the values are un ordered I would recommend a selection algorithm to find the median, see Selection algorithm (wikipedia).
I also assume that the chunks don't contain elements in some sequential order such that the smallest elements come in one chunk and then the larger in the next etc. At which case you only need to find the median in the middle chunk.
I think you're looking for some kind of recursive estimator of the median but to find a good estimator for this is hard. I would recommend to use some kind of frequency count which you update for each new chunk of data giving you the possibility to get the median using these counts. Depending on the amount of possible values this might become unfeasible in terms of space. But depending on the data structure used you should be able to do this for most cases.