Solved – Integrating an empirical CDF

empirical-cumulative-distr-fnintegralr

I have an empirical distribution $G(x)$. I calculate it as follows

    x <- seq(0, 1000, 0.1)
    g <- ecdf(var1)
    G <- g(x)

I denote $h(x) = dG/dx$, i.e., $h$ is the pdf while $G$ is the cdf.

I now want to solve an equation for the upper limit of integration (say, $a$), such that the expected value of $x$ is some $k$.

That is, integrating from $0$ to $b$, I should have $\int xh(x)dx = k$. I want to solve for $b$.

Integrating by parts, I can rewrite the equation as

$bG(b) – \int_0^b G(x)dx = k$, where the integral is from $0$ to $b$ ——- (1)

I think I can calculate the integral as follows

    intgrl <- function(b) {
        z <- seq(0, b, 0.01)
        G <- g(z)
        return(mean(G))
     }

But when I try to use this function with

    library(rootSolve)
    root <- uniroot.All(fun, c(0, 1000))

where fun is eq (1), I get the following error

    Error in seq.default(0, b, by = 0.01) : 'to' must be of length 1

I think the issue is that my function intgrl is evaluated at a numeric value, while uniroot.All is passing the interval c(0,1000)

How should I solve for $b$ in this situation in R?

Best Answer

Let the sorted data be $x_1 \le x_2 \le \cdots \le x_n$. To understand the empirical CDF $G$, consider one of the values of the $x_i$--let's call it $\gamma$--and suppose that some number $k$ of the $x_i$ are less than $\gamma$ and $t \ge 1$ of the $x_i$ are equal to $\gamma$. Pick an interval $[\alpha, \beta]$ in which, of all the possible data values, only $\gamma$ appears. Then, by definition, within this interval $G$ has the constant value $k/n$ for numbers less than $\gamma$ and jumps to the constant value $(k+t)/n$ for numbers greater than $\gamma$.

ECDF

Consider the contribution to $\int_0^b x h(x) dx$ from the interval $[\alpha,\beta]$. Although $h$ is not a function--it is a point measure of size $t/n$ at $\gamma$--the integral is defined by means of integration by parts to convert it into an honest-to-goodness integral. Let's do this over the interval $[\alpha,\beta]$:

$$\int_\alpha^\beta x h(x) dx = \left(x G(x)\right)\vert_\alpha^\beta - \int_\alpha^\beta G(x) dx = \left(\beta G(\beta) - \alpha G(\alpha)\right) -\int_\alpha^\beta G(x) dx. $$

The new integrand, although it is discontinuous at $\gamma$, is integrable. Its value is easily found by breaking the domain of integration into the parts preceding and following the jump in $G$:

$$\int_\alpha^\beta G(x)dx = \int_\alpha^\gamma G(\alpha) dx + \int_\gamma^\beta G(\beta) dx = (\gamma-\alpha)G(\alpha) + (\beta-\gamma)G(\beta).$$

Substituting this into the foregoing and recalling $G(\alpha)=k/n, G(\beta)=(k+t)/n$ yields

$$\int_\alpha^\beta x h(x) dx = \left(\beta G(\beta) - \alpha G(\alpha)\right) - \left((\gamma-\alpha)G(\alpha) + (\beta-\gamma)G(\beta)\right) = \gamma\frac{t}{n}.$$

In other words, this integral multiplies the location (along the $X$ axis) of each jump by the size of that jump. The size of the jump is

$$\frac{t}{n} = \frac{1}{n} + \cdots + \frac{1}{n}$$

with one term for each of the data values that equals $\gamma$. Adding the contributions from all such jumps of $G$ shows that

$$\int_0^b x h(x) dx = \sum_{i:\, 0 \le x_i \le b} \left(x_i\frac{1}{n}\right) = \frac{1}{n}\sum_{x_i\le b}x_i.$$

We might call this a "partial mean," seeing that it equals $1/n$ times a partial sum. (Please note that it is not an expectation. It can be related to the expectation of a version of the underlying distribution that has been truncated to the interval $[0,b]$: you must replace the $1/n$ factor by $1/m$ where $m$ is the number of data values within $[0,b]$.)

Given $k$, you wish to find $b$ for which $\frac{1}{n}\sum_{x_i\le b}x_i = k.$ Because the partial sums are a finite set of values, usually there is no solution: you will need to settle for the best approximation, which can be found by bracketing $k$ between two partial means, if possible. That is, upon finding $j$ such that

$$\frac{1}{n}\sum_{i=1}^{j-1} x_i \le k \lt \frac{1}{n}\sum_{i=1}^j x_i,$$

you will have narrowed $b$ to the interval $[x_{j-1}, x_j)$. You can do no better than that using the ECDF. (By fitting some continuous distribution to the ECDF you can interpolate to find an exact value of $b$, but its accuracy will depend on the accuracy of the fit.)

R performs the partial sum calculation with cumsum and finds where it crosses any specified value using the which family of searches, as in:

set.seed(17)
k <- 0.1
var1 <- round(rgamma(10, 1), 2)
x <- sort(var1)
x.partial <- cumsum(x) / length(x)
i <- which.max(x.partial > k)
cat("Upper limit lies between", x[i-1], "and", x[i])

The output in this example of data drawn iid from an Exponential distribution is

Upper limit lies between 0.39 and 0.57

The true value, solving $0.1 = \int_0^b x \exp(-x)dx,$ is $0.531812$. Its closeness to the reported results suggests this code is accurate and correct. (Simulations with much larger datasets continue to support this conclusion).

Here is a plot of the empirical CDF $G$ for these data, with the estimated values of the upper limit shown as vertical dashed gray lines:

Figure of ECDF

Related Solutions

Solved – Integrate with eCDF quickly in R

Defining the empirical distribution function $$ \hat{F}_n(t)=\frac{1}{n}\sum_{i=1}^n I_{[x_i,\infty)}(t) \, , $$ it follows that $$ \int_{-\infty}^\infty g(t)\,d\hat{F}_n(t) = \frac{1}{n} \sum_{i=1}^n g(x_i) \, . $$ Hence, you don't need to use integrate() to solve this problem. This kind of R code

x <- rnorm(10^6)
g <- function(t) exp(t) # say
mean(g(x))

should be super fast because it is vectorized.

Solved – Importance Sampling to evaluate integral in R

When running the code provided for the second function I get

> c(mean(Y),var(Y))
[1] 3.2981238 0.5203621
> integrate(f,0.01,1)
3.19264 with absolute error < 1.1e-06

which means that the true value of the integral is close to 3.2, not to 0.70.

If you want to integrate f from 0.3 to 8, then the importance function must be

> w <- function(x) dunif(x,0.3,8)/dnorm(x,0.5,0.25)

and I changed the handling of the NA's in the problem by

> f <- function(x) (x>0)*(1+sinh(2*x)*log(abs(x)))^(-1)

which leads to

> Y <- w(X)*f(X)
> integrate(f,0.3,8)
2.77512 with absolute error < 2.4e-05
> integrate(f,0.3,8)$val/(8-.3)
[1] 0.3604052

which shows a good fit even though the Normal importance function N(0.5,0.5) may be too concentrated, i.e., it has too small a variance to cover (0.3,8) with some reasonable probability. If one changes the scale of the Normal,

> X <- rnorm(1e5,0.5,2.25)
> Y <- w(X)*f(X)
> c(mean(Y),var(Y))
[1] 0.3664046 1.0230197
> integrate(f,0.3,8)$va/(8-.3)
[1] 0.3604052

which shows a fairly good fit. When using a smaller standard deviation like $\sigma=0.1$, I find a poorer fit

> c(mean(Y),var(Y))
[1]   0.3370849 484.6246147

even when compared with $\sigma=1$:

> c(mean(Y),var(Y))
[1] 0.3606815 0.3232931

To make sense of those variations, I ran the experiment for a range of values of $\sigma$, from $0.1$ to $4$, with $10^6$ normal simulations, and got the following graph (with a log scale on the first axis)

which shows the improvement brought by large enough $\sigma$'s. And an optimum around $\sigma=0.74$.

Here is the relevant R code

w <- function(x) dunif(x,0.3,8)/dnorm(x,0.5,sigma) 
sigs=varz=meanz=seq(.1,4,le=50)
for (i in 1:50){
sigma=sigs[i];X=0.5+sigma*X0;Y=w(X)*f(X)
varz[i]=var(Y);meanz[i]=mean(Y)}

Best Answer

Related Solutions

Solved – Integrate with eCDF quickly in R

Solved – Importance Sampling to evaluate integral in R

Related Question