Solved – How to calculate median of distributed data

medianquantiles

How to find the median (or approximate median) of n chunks of values. The values in each chunk can be assumed to have been sorted if it helps. In addition, the existence of a histogram for each chunk is allowed.

Best Answer

I first assume you do not have the possibility to save all chunks and just compute the median from all values. If you do but the values are un ordered I would recommend a selection algorithm to find the median, see Selection algorithm (wikipedia).

I also assume that the chunks don't contain elements in some sequential order such that the smallest elements come in one chunk and then the larger in the next etc. At which case you only need to find the median in the middle chunk.

I think you're looking for some kind of recursive estimator of the median but to find a good estimator for this is hard. I would recommend to use some kind of frequency count which you update for each new chunk of data giving you the possibility to get the median using these counts. Depending on the amount of possible values this might become unfeasible in terms of space. But depending on the data structure used you should be able to do this for most cases.