Solved – Are there techniques to merge two cumulative distribution functions

cumulative distribution functionquantiles

I'm trying to estimate quantiles from cumulative distribution functions. Given N CDFs are there techniques that can be used to merge them to form another CDF from which a quantile can be estimated ?

Thanks everyone for looking at the question. Here is a more concrete example: Imagine a web service being served by N server instances. Each server constructs an empirical CDF of the response times of a request. Now I would like to aggregate or merge the CDFs to find say the 95th percentile response time of the web service. The only data available is the empirical CDF and number of samples used to construct the CDF.

Best Answer

If I understand you correctly, your question can be answered by using a data structure to estimate quantiles - one for each server instance, and that these N data structures has the ability to be 'merged' (or 'reduced') with the rest to get aggregate estimates of the entire cluster.

I believe the TDigest data structure should be able to help you: https://github.com/tdunning/t-digest

Here is the description:

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means. The t-digest algorithm is also very parallel friendly making it useful in map-reduce and parallel streaming applications.

This way, you can calculate estimates for each server, AND, also combine these aggregated estimates for the entire cluster, without processing ALL the raw data for the entire cluster again.

More details on the underlying techniques can be followed from the link. I believe it is based on the q-digest algorithm.

Related Question