Solved – Empirical PDF from Empirical CDF

cumulative distribution functiondensity functionempirical-likelihoodhistogram

Suppose I do an experiment $N$ times and get a vector $X$ of results. Let $C_X(y)$ be the empirical cumulative distribution function of $X$. Suppose $X$ is sorted so that $x_1 \leq x_2 \cdots \leq x_N$. Approximately,
$$C_X(y)=0\textrm{ if }y \leq x \textrm{ for all }x \in X$$
$$C_X(y)=1\textrm{ if }y > x \textrm{ for all }x \in X$$
$$C_X(y)=\frac{i+\frac{y-x_i}{x_{i+1}-x_i}}{N} \textrm{ if } x_i \leq y \leq x_{i+1} $$

Question: What is the most efficient way to compute the corresponding empirical PDF of $X$? Just interpolate through the histogram?

Best Answer

One of two things:

1) make fixed histogram bucket sizes and then count the number of points you get that occur in each bucket. In other words, break up the range of $x$ into n equal intervals, and then the count for each interval is the number of times your CDF has a 'step' up in that interval, for each interval. Caveat: you will need to normalize, when done, so that all buckets add to 100% probability.

2) Just take the differences between each pair of CDF points (thus the change in height between them), divide by $\delta x_i$ to get the slope of the CDF at that point along the $x$ axis, and use lines of those slopes to connect the points of a PDF plot. Essentially, you are taking and using the numerical approximation to the derivative to the CDF, which is the PDF. Warning: you will need to think through very carefully if how you do this does not, accidentally, shift the distribution up or down by something like $\delta x_i/2$ at each point. In other words, centering each segment will be important to get right.

If you have a good number of points, method 1 will be a lot less error-prone - e.g., with 1000 points you can probably get a good discrete histogram representation to something like a normal distribution with 20-50 buckets which you can do numerical statistics on easily (mean, moments). Since that is usually what you want, it does the job.

I sense your desire to do something that looks more like a continuous function, which method 2 would get, but I would warn you away from that, unless you have a small number of data points. You will find that: (1) it is going to be hard to represent somehow (i.e., on a spreadsheet or as a data structure); (2) it will be hard to work with even a good representation, and (3) it will take a lot of thought to get right.

I do a lot of numerical methods with unknown distributions and method one is surprisingly accurate most of the time (again, with enough points).

Related Question