Probability Density – Is Integrating Over an Interval of Probability Density Curve the Cumulative Probability?

cumulative distribution functiondensity functionprobabilityr

set.seed(42)
a <- rnorm(200) # generate random data
hist(a, probability = T) # histogram
d <- density(a) # calculate density
polygon(d, density = 10, col = "deeppink") # plot density
abline(v=0, col="blue") # draw boundaries
abline(v=0.1, col="blue")

If I want to know the probability of the variable between 2 points (since it is considered continuous), e.g. 0 and 0.1, the way to calculate it is:

integrate(approxfun(d), lower = 0, upper = 0.1)

Since the probability density function is the derivative of the cumulative distribution function, when I apply integrate(approxfun(d), lower = 0, upper = 0.1), I am reversing the derivation. Does this mean that the result I get is cumulative probability?

Best Answer

For easier reading, I have combined three extensive Comments (now deleted) into an Answer:

You don't have the true PDF $f(x)$ from density in R. From the code, we know $X$ is standard normal, so the exact value of $p=P(0<X<1)$ could be found from in R as $0.3413447.$

diff(pnorm(c(0,1)))
[1] 0.3413447

However, I suppose you want to get $p$ from your $n=200$ observations a. The most direct way to do that is to find the proportion of values of a in (0,1):

set.seed(42)
a = rnorm(200) # generate random data

mean((a > 0) & (a < 1)) 
[1] 0.33

Alternatively, if you know data are normal, then you could estimate $μ,σ$ from data and use R's normal pdf function to get $0.3429.$

mu = mean(a);  sd = sd(a)
diff(pnorm(c(0,1), mu, sd)) 
[1] 0.3428732

I wouldn't expect a density estimator to do very much better.

The output of density in R is a sequence of 512 x-values and 512 y-values that can be used to plot the estimated PDF (enclosing unit area).

density(a)

Call:
        density.default(x = a)

Data: a (200 obs.);     Bandwidth 'bw' = 0.2895

       x                 y            
 Min.   :-3.8616   Min.   :0.0000781  
 1st Qu.:-2.0036   1st Qu.:0.0124148  
 Median :-0.1456   Median :0.0685148  
 Mean   :-0.1456   Mean   :0.1344194  
 3rd Qu.: 1.7124   3rd Qu.:0.2425435  
 Max.   : 3.5704   Max.   :0.4123541 

hist(a, prob=T, col="skyblue2")
 lines(density(a), col="brown", lwd=2)
 rug(a)

The figure below shows a histogram of a along with the density estimator. Tick marks along the horizontal axis show locations of the $n=200$ observations. [Sometimes density estimators are informally called 'smoothed histograms', but they are based on individual data points without reference to the binning of any histogram. The density estimator used here is the default estimator from density in R; variations are available via parameters not used here.]

You might try to use this output to estimate $p,$ as follows, to get $p \approx 0.337867.$

xx = density(a)$x; yy = density(a)$y
sum(yy[xx > 0 & xx < 1])/sum(yy)
[1] 0.337867

This method does have the advantage of not needing to know the population family of distributions (e.g., normal).

Addendum, showing results for a much larger sample: $n=10\,000.$

set.seed(2022)

# counting points
A = rnorm(10000)
mean((A > 0) & (A < 1))
[1] 0.3396

# assuming normality 
mu = mean(A);  sd = sd(A)
diff(pnorm(c(0,1), mu, sd))
[1] 0.3407714

# density estimation 
xx = density(A)$x;  yy = density(A)$y
sum(yy[xx > 0 & xx < 1])/sum(yy)
[1] 0.3373269

Related Solutions

Solved – Calculate tail probabilities from density() call in R

I would take the same approach as @Flounderer, but exploit another feature of R's density() function; namely the from and to arguments, which restrict the density estimation to the region enclosed by the two arguments. This results in the same density estimates as running the function without from and/or to, but by restricting the range of the density estimate to the region of interest, we focus all of the n evaluation points on the region of interest.

set.seed(1)
x <-rnorm(1000)
hist(x,freq=FALSE)
lines( dens <- density(x) )
lines( dens2 <- density(x, from = 1, n = 1024), col = "red", lwd = 2)

This produces

enter image description here

The red line is to illustrate that the density estimates in dens and dens2 are the same for the region of interest.

Then you can follow the approach @Flounderer used to evaluate the tail probability:

> with(dens2, sum(y * diff(x)[1]))
[1] 0.1680759

The advantage of this approach is to expend the n observations at which density() evaluates the KDE all on the region of interest. The larger n the higher the resolution that you have in evaluating the tail probability.

Note from ?density that given the FFT used in the implementation, having n as a multiple of 2 is advantageous.

Solved – Calculating (stepwise) cumulative probability from density in R

Your question is not totally clear for me, but there is a simple answer to your main question. Kernel density estimate is in fact a mixture distribution

$$ f(x) = \sum_{i=1}^n p_i \, f_i(x; x_i, h) $$

where $p_i$ are mixing proportions, all equal to $1/n$ and $f_i$ distributions are your kernels, each parametrized by mean $x_i$ and bandwidth $h$. Cumulative distribution function of a mixture distribution is

$$ F(x) = \sum_{i=1}^n p_i \, F_i(x; x_i, h) $$

so with Gaussian kernel it is simply

$$ F(x) = \frac{1}{n} \sum_{i=1}^n \Phi(\tfrac{x-x_i}{h}) $$

where $\Phi$ is a standard normal cdf.

You do not need to use density function for it, instead you can simply use the direct algorithm:

# your data
x <- c(rnorm(50), rnorm(120, 1.5, 0.5), rnorm(50, -1, 0.1))
# points you want to evaluate your cdf on
grid <- seq(-6, 6, by = 0.01)
# bandwidth
h <- 0.1

p <- numeric(length(grid))
for (i in 1:length(x))
  p <- p + pnorm(grid, x[i], h)
p <- p/length(x)

This can be achieved in R also with more compact code:

apply(vapply(1:length(x), function(i) pnorm(grid, x[i], h)/length(x), numeric(length(grid))), 1, sum)

Below you can see the cdf of such distribution plotted against empirical cumulative distribution function of the simulated data (as above).

Best Answer

Related Solutions

Solved – Calculate tail probabilities from density() call in R

Solved – Calculating (stepwise) cumulative probability from density in R

Related Question