Solved – obtaining empirical CDF of a given data

empirical-cumulative-distr-fnpython

I have a dataset of variable $x$ that has a value between 0 and 6. I would like to have a function that defines empirical CDF of variable $x$. Since $x$ does not have a specific distribution (such as Gaussian, etc.), I need to rely on data values to create this function. Using the following code, I can plot the empirical CDF as:

max_diam = 6
ax = sns.distplot(x, hist_kws=dict(cumulative=True), kde_kws=dict(cumulative=True)).set(xlim=(0, max_diam))
ax = sns.kdeplot(x, bw=.1, cumulative=True).set(xlim=(0, max_diam), ylim=(0, 1.0))#, color="r")
plt.show()

Now I would like to find the function that kdeplot uses to plot CDF. I have tried to do regression, but the quality is not good, as there is only a single point after 4.9 (6.0) which makes the plot overfit for high orders and underfit in low orders

def ecdf(data):
    # Compute ECDF
    x = np.sort(data)
    n = x.size
    y = np.arange(1, n+1) / n
    return(x,y)

x, y = ecdf(x)
degree=7
lw = 2

plt.scatter(x=x, y=y, s=10);
plt.xlabel('x', fontsize=16)
plt.ylabel('y', fontsize=16)

polynomial_features= PolynomialFeatures(degree)
x_poly = polynomial_features.fit_transform(x.reshape(-1, 1))

model = skl.LinearRegression()
model.fit(x_poly, y)

x_test = polynomial_features.fit_transform(x_plot.reshape(-1, 1))
y_test = model.predict(x_test)
plt.plot(x_plot, y_test, color='yellowgreen', linewidth=lw, label="degree %d" % degree)

plt.show()

So my question is:

Is there a way to get the function that kdeplot is using for plotting the orange line?
Is there a way to have a better regression that is accurate and does not over/underfit?

Best Answer

The concept of the empirical CDF (ECDF) of a sample is very simple. First, the value of the ECDF below the minimum observation is $0$ and its value above the maximum observation is $1.$ Second, sort the data from smallest to largest. If there are $n$ observations (all distinct), then the ECDF jumps up by $1/n$ at each observation. If there are ties, the jump is $d/n$ for $d$ values tied at the same value.

In R, the expression ecdf does the work. (You might want to read the R documentation for ecdf.) For moderate and large sample sizes the ECDF is often a good approximation of the distribution of the population from which the data are randomly sampled (shown in red in the plots below).

Examples (in R):

set.seed(813)
x = runif(50, 0, 10);  plot(ecdf(x));  rug(x)
  curve(punif(x, 0, 10), add=T, col="red", n=10001)

set.seed(2019)
x = rpois(10, 3); plot(ecdf(x))
curve(ppois(x, 3), add=T, col="red", n = 10001)

set.seed(1066)
x = rexp(5000, 1/10);  plot(ecdf(x))
curve(pexp(x, 1/10), add-T, col="red")

Note: Q-Q plots (with theoretical and sample quantiles) often amount to ECDF plots with scales suitably distorted so that the population CDF is a straight line.

Addendum per @whuber Comment:

For a small dataset from a gamma distribution, we begin by showing a histogram of the data along with the true density function (left) and an ECDF of the data along with the true CDF (right). For illustration, I chose a small sample so that there will be a clear distinction between exact curves (blue) and estimated ones (red).

set.seed(814)
x = rgamma(100, 10, .2)
par(mfrow=c(1,2))
 hist(x, prob=T, ylim=c(0,.03))
  curve(dgamma(x, 10, .2), add=T, col="blue")
 plot(ecdf(x), pch=".")
  curve(pgamma(x, 10, .2), add=T, col="blue")
par(mfrow=c(1,1))

If the true population distribution is not known, its density function can be estimated by a kernel density estimator (KDE). We use the default KDE in R. The output is two vectors: x-values and y-values for plotting. These vectors are summarized below, and the first six entries in each vector are shown.

density(x)

Call:
        density.default(x = x)

Data: x (100 obs.);     Bandwidth 'bw' = 5.494

       x                 y            
 Min.   :  2.599   Min.   :9.031e-06  
 1st Qu.: 32.251   1st Qu.:9.730e-04  
 Median : 61.902   Median :4.177e-03  
 Mean   : 61.902   Mean   :8.423e-03  
 3rd Qu.: 91.554   3rd Qu.:1.602e-02  
 Max.   :121.205   Max.   :2.527e-02  
head(density(x)$x)
[1] 2.599014 2.831120 3.063227 3.295333 3.527439 3.759546
head(density(x)$y)
[1] 9.030655e-06 1.029092e-05 1.171087e-05 1.327874e-05 1.500377e-05 1.701109e-05

The points in the x-vector are evenly spaced. The points in the y-vector are scaled so that the curve enclosed by the KDE will be (almost exactly) 1. The KDE vectors can be used to estimate the CDF. Plotting points are x.k = ecdf(x)$x a and y.k = cumsum(ecdf(x)$y)/sum(ecdf(x)$y). Here are plots of the histogram of x along with the KDE, and the ECDF along with the CDF as estimated via the KDE.

x.k = density(x)$x
y.k = cumsum(density(x)$y)/sum(density(x)$y)
par(mfrow=c(1,2))
 hist(x, prob=T)
  lines(density(x), col="red")
 plot(ecdf(x), pch=".")
  lines(x.k, y.k, col="red")
par(mfrow=c(1,1))

Related Solutions

Solved – What inferential method produces the empirical CDF

In An Introduction to the Bootstrap, Efron and Tibshirani find it useful to characterize the empirical cumulative distribution function (ecdf) as the nonparametric maximum likelihood estimate of the "underlying population" $F$.

Given data $x_1, x_2, \ldots, x_n$, the likelihood function (by definition) is the product of the probabilities

$$L(F) = \prod_{i=1}^n {\Pr}_F(x_i).$$

E&T claim this is maximized by the ecdf. Since they leave it as an exercise, let's work out the solution here. It's not completely trivial, because we have to account for the possibility of duplicates among the data. Let's take care with the notation, then. Let $x_1, \ldots, x_m$ be the distinct data values, with $x_i$ appearing $k_i \ge 1$ times in the dataset. (Thus, $x_{m+1}, \ldots, x_n$ are all duplicates of the first $m$ values.) The ecdf is the discrete distribution that assigns probability $k_i/n$ to $x_i$ for $1 \le i \le m$.

For any distribution $F$, the likelihood $L(F)$ has $k_i$ terms equal to $p_i = {\Pr}_F(x_i)$ for each $i$. It therefore is completely determined by the vector $p=(p_1, p_2, \ldots, p_m)$ and can be computed as

$$L(F) = L(p) = \prod_{i=1}^m p_i^{k_i}.$$

Since the likelihood for the ecdf is nonzero, the maximum likelihood will be nonzero. Therefore, for any distribution $\hat F$ that maximizes the likelihood, $p_i = {\Pr}_{\hat F}(x_i)$ must be nonzero for all the data. The Axiom of Total Probability asserts the sum of the $p_i$ is at most $1$. This reduces the problem to a constrained optimization:

$$\text{Maximize } L(p) = \prod_{i=1}^m p_i^{k_i}$$

subject to

$$p_i \gt 0, i=1, 2, \ldots m;\quad \sum_{i=1}^m p_i \le 1.$$

This can be solved in many ways. Perhaps the most direct is to use a Lagrange multiplier $\lambda$ to optimize $\log L$, which produces the critical equations

$$\left(\frac{p_1}{k_1}, \frac{p_2}{k_2}, \ldots, \frac{p_m}{k_m}\right) = \lambda\left(1, 1, \ldots, 1\right)$$

with unique solution $$\hat p_i = \frac{k_i}{k_1+\cdots+k_m} = \frac{k_i}{n},$$

precisely the ecdf, QED.

Why is this point of view important? Here are E&T:

As a result, [any] functional statistic $t(\hat F)$ is the nonparametric maximum likelihood estimate of the parameter $t(F)$. In this sense, the nonparametric bootstrap carries out nonparametric maximum likelihood inference.

[Section 21.7, p. 310]

Some words of explanation: "as a result" follows from the (easily proven) fact that the MLE (maximum likelihood estimate) of any function of a parameter is that function of the MLE of the parameter. A "functional statistic" (or "plug-in" statistic) is one that depends only on the distribution function. As an example of this distinction, E&T point out that the usual unbiased variance estimator $s^2 = \sum (x_i-\bar x)^2/(n-1) $ is not a functional statistic because if you were to double all the data, the ecdf would not change, but the $s^2$ would be multiplied by $2(n-1)/(2n-1)$, which does change (albeit only slightly). Functional statistics are crucial to understanding and analyzing the Bootstrap.

Reference

Bradley Efron and Robert J. Tibshirani, An Introduction to the Bootstrap. Chapman & Hall, 1993.

Solved – Integrating an empirical CDF

Let the sorted data be $x_1 \le x_2 \le \cdots \le x_n$. To understand the empirical CDF $G$, consider one of the values of the $x_i$--let's call it $\gamma$--and suppose that some number $k$ of the $x_i$ are less than $\gamma$ and $t \ge 1$ of the $x_i$ are equal to $\gamma$. Pick an interval $[\alpha, \beta]$ in which, of all the possible data values, only $\gamma$ appears. Then, by definition, within this interval $G$ has the constant value $k/n$ for numbers less than $\gamma$ and jumps to the constant value $(k+t)/n$ for numbers greater than $\gamma$.

ECDF

Consider the contribution to $\int_0^b x h(x) dx$ from the interval $[\alpha,\beta]$. Although $h$ is not a function--it is a point measure of size $t/n$ at $\gamma$--the integral is defined by means of integration by parts to convert it into an honest-to-goodness integral. Let's do this over the interval $[\alpha,\beta]$:

$$\int_\alpha^\beta x h(x) dx = \left(x G(x)\right)\vert_\alpha^\beta - \int_\alpha^\beta G(x) dx = \left(\beta G(\beta) - \alpha G(\alpha)\right) -\int_\alpha^\beta G(x) dx. $$

The new integrand, although it is discontinuous at $\gamma$, is integrable. Its value is easily found by breaking the domain of integration into the parts preceding and following the jump in $G$:

$$\int_\alpha^\beta G(x)dx = \int_\alpha^\gamma G(\alpha) dx + \int_\gamma^\beta G(\beta) dx = (\gamma-\alpha)G(\alpha) + (\beta-\gamma)G(\beta).$$

Substituting this into the foregoing and recalling $G(\alpha)=k/n, G(\beta)=(k+t)/n$ yields

$$\int_\alpha^\beta x h(x) dx = \left(\beta G(\beta) - \alpha G(\alpha)\right) - \left((\gamma-\alpha)G(\alpha) + (\beta-\gamma)G(\beta)\right) = \gamma\frac{t}{n}.$$

In other words, this integral multiplies the location (along the $X$ axis) of each jump by the size of that jump. The size of the jump is

$$\frac{t}{n} = \frac{1}{n} + \cdots + \frac{1}{n}$$

with one term for each of the data values that equals $\gamma$. Adding the contributions from all such jumps of $G$ shows that

$$\int_0^b x h(x) dx = \sum_{i:\, 0 \le x_i \le b} \left(x_i\frac{1}{n}\right) = \frac{1}{n}\sum_{x_i\le b}x_i.$$

We might call this a "partial mean," seeing that it equals $1/n$ times a partial sum. (Please note that it is not an expectation. It can be related to the expectation of a version of the underlying distribution that has been truncated to the interval $[0,b]$: you must replace the $1/n$ factor by $1/m$ where $m$ is the number of data values within $[0,b]$.)

Given $k$, you wish to find $b$ for which $\frac{1}{n}\sum_{x_i\le b}x_i = k.$ Because the partial sums are a finite set of values, usually there is no solution: you will need to settle for the best approximation, which can be found by bracketing $k$ between two partial means, if possible. That is, upon finding $j$ such that

$$\frac{1}{n}\sum_{i=1}^{j-1} x_i \le k \lt \frac{1}{n}\sum_{i=1}^j x_i,$$

you will have narrowed $b$ to the interval $[x_{j-1}, x_j)$. You can do no better than that using the ECDF. (By fitting some continuous distribution to the ECDF you can interpolate to find an exact value of $b$, but its accuracy will depend on the accuracy of the fit.)

R performs the partial sum calculation with cumsum and finds where it crosses any specified value using the which family of searches, as in:

set.seed(17)
k <- 0.1
var1 <- round(rgamma(10, 1), 2)
x <- sort(var1)
x.partial <- cumsum(x) / length(x)
i <- which.max(x.partial > k)
cat("Upper limit lies between", x[i-1], "and", x[i])

The output in this example of data drawn iid from an Exponential distribution is

Upper limit lies between 0.39 and 0.57

The true value, solving $0.1 = \int_0^b x \exp(-x)dx,$ is $0.531812$. Its closeness to the reported results suggests this code is accurate and correct. (Simulations with much larger datasets continue to support this conclusion).

Here is a plot of the empirical CDF $G$ for these data, with the estimated values of the upper limit shown as vertical dashed gray lines:

Figure of ECDF

Best Answer

Related Solutions

Solved – What inferential method produces the empirical CDF

Reference

Solved – Integrating an empirical CDF

Related Question