Solved – How to estimate probability density function (pdf) from empirical cumulative distribution function (ecdf)

density functionempirical-cumulative-distr-fnsurvival

The context is survival analysis, where I have an empirical survival function in the form of a step function, which is just one minus the ecdf. Is there a standard way to get an estimate of the pdf (in curves or histograms)?

Best Answer

This is an outline rather than a complete answer. There are two main issues: (a) finding the data values used to make the ECDF plot, and (b) using a histogram and KDE methods to estimate the PDF.

Data for an example: Here is a simple demo in R for a sample with $n=100$ unique values from $\mathsf{Norm}(\mu=50,\, \sigma = 7):$

set.seed(918);  x = runif(25, 50, 7);  sort(x)
set.seed(1234);  x = rnorm(100, 50, 7);  sort(x)
  [1] 33.58012 34.73972 37.35778 38.59635 39.86257 40.26509 40.39389 40.61305 41.23610 41.55054
 [11] 41.63830 41.82667 41.89534 42.02975 42.05774 42.23777 42.24877 42.51950 42.83441 42.89527
 [21] 43.01129 43.03962 43.22040 43.44836 43.62163 43.76974 44.01245 44.13980 44.56622 44.58653
 [31] 44.93493 45.03392 45.14396 45.31256 45.92547 45.97682 46.04884 46.17358 46.33320 46.42293
 [41] 46.49119 46.52205 46.53092 46.56520 46.65965 46.67085 46.73872 46.81172 46.91616 47.18088
 [51] 47.43433 47.78059 47.93994 48.03564 48.65884 48.75547 48.78349 48.81004 48.86383 48.92621
 [61] 49.22800 49.62789 49.75668 49.89403 49.94677 50.04825 50.45121 50.93862 51.78637 51.80783
 [71] 51.94200 52.35531 52.48885 53.00387 53.21713 53.54239 53.84998 53.94139 54.02329 54.53801
 [81] 54.59612 54.88326 55.95163 56.14743 56.71646 56.81042 57.01059 57.59109 57.71608 59.30695
 [91] 59.57479 60.14647 61.24137 61.53472 61.94175 62.43959 64.49190 64.84782 66.91085 67.84294

The ECDF plot is shown below:

plot.ecdf(x)

Finding data values. The fundamental idea is to reclaim the individual data values ('knots') at which the ECDF has jumps. If the ECDF was made using R, the knots can be found as follows:

Fn = ecdf(x);  xr = knots(Fn);  xr
  [1] 33.58012 34.73972 37.35778 38.59635 39.86257 40.26509 40.39389 40.61305 41.23610 41.55054
 [11] 41.63830 41.82667 41.89534 42.02975 42.05774 42.23777 42.24877 42.51950 42.83441 42.89527
 [21] 43.01129 43.03962 43.22040 43.44836 43.62163 43.76974 44.01245 44.13980 44.56622 44.58653 
..
 [91] 59.57479 60.14647 61.24137 61.53472 61.94175 62.43959 64.49190 64.84782 66.91085 67.84294

If the information for the ECDF is not already in R, presumably you can find or estimate the individual values from the information you do have about the ECDF. If there are ties, a knot may represent multiple observations and the value should be repeated the appropriate number of times. [If you search online with key words such as 'values of empirical distribution function' or 'finding ECDF values', you will find a variety of methods for reclaiming the original data values.]

Density estimation: Once the individual values are reclaimed or estimated, you can make a histogram on a density scale (so that the sum of the areas of the bars is unity), and use 'kernel density estimation' (KDE) to 'smooth' the histogram.

hist(xr, prob=T, col="skyblue2")
lines(density(xr), type="l", col="red", lwd=2)  # type is 'ell', not 'one'
curve(dnorm(x, 50, 7), add=T, lty="dotted")

The red curve in plot below shows the default KDE in R (at the default 'window' value). You can look at R documentation and the Wikipedia article for more information on 'kernel density estimation'. [I have found the cited publications of Bernard Silverman to be especially helpful.] The dotted line shows the density of the normal distribution from which the data were sampled. [Generally speaking, for very large samples the two density curves would be in better agreement.]

The figure below was made using a sample of 1000 from a normal distribution.

set.seed(918);  y = rnorm(1000, 50, 7)
hist(y, prob=T, col="skyblue2")
lines(density(y), type="l", col="red", lwd=2)
curve(dnorm(x, 50, 7), add=T, lty="dotted")

Related Solutions

R Density Function – How to Find Probability Density Function from Density Function

?density points out that it uses approx to do linear interpolation already; ?approx points out that approxfun generates a suitable function:

x <- log(rgamma(150,5))
df <- approxfun(density(x))
plot(density(x))
xnew <- c(0.45,1.84,2.3)
points(xnew,df(xnew),col=2)

enter image description here

By use of integrate starting from an appropriate distance below the minimum in the sample (a multiple - say 4 or 5, perhaps - of the bandwidth used in df would generally do for an appropriate distance), one can obtain a good approximation of the cdf corresponding to df.

Solved – Empirical PDF from Empirical CDF

One of two things:

1) make fixed histogram bucket sizes and then count the number of points you get that occur in each bucket. In other words, break up the range of $x$ into n equal intervals, and then the count for each interval is the number of times your CDF has a 'step' up in that interval, for each interval. Caveat: you will need to normalize, when done, so that all buckets add to 100% probability.

2) Just take the differences between each pair of CDF points (thus the change in height between them), divide by $\delta x_i$ to get the slope of the CDF at that point along the $x$ axis, and use lines of those slopes to connect the points of a PDF plot. Essentially, you are taking and using the numerical approximation to the derivative to the CDF, which is the PDF. Warning: you will need to think through very carefully if how you do this does not, accidentally, shift the distribution up or down by something like $\delta x_i/2$ at each point. In other words, centering each segment will be important to get right.

If you have a good number of points, method 1 will be a lot less error-prone - e.g., with 1000 points you can probably get a good discrete histogram representation to something like a normal distribution with 20-50 buckets which you can do numerical statistics on easily (mean, moments). Since that is usually what you want, it does the job.

I sense your desire to do something that looks more like a continuous function, which method 2 would get, but I would warn you away from that, unless you have a small number of data points. You will find that: (1) it is going to be hard to represent somehow (i.e., on a spreadsheet or as a data structure); (2) it will be hard to work with even a good representation, and (3) it will take a lot of thought to get right.

I do a lot of numerical methods with unknown distributions and method one is surprisingly accurate most of the time (again, with enough points).

Best Answer

Related Solutions

R Density Function – How to Find Probability Density Function from Density Function

Solved – Empirical PDF from Empirical CDF

Related Question