Probability Density – Is Integrating Over an Interval of Probability Density Curve the Cumulative Probability?

cumulative distribution functiondensity functionprobabilityr

set.seed(42)
a <- rnorm(200) # generate random data
hist(a, probability = T) # histogram
d <- density(a) # calculate density
polygon(d, density = 10, col = "deeppink") # plot density
abline(v=0, col="blue") # draw boundaries
abline(v=0.1, col="blue")

If I want to know the probability of the variable between 2 points (since it is considered continuous), e.g. 0 and 0.1, the way to calculate it is:

integrate(approxfun(d), lower = 0, upper = 0.1)

Since the probability density function is the derivative of the cumulative distribution function, when I apply integrate(approxfun(d), lower = 0, upper = 0.1), I am reversing the derivation. Does this mean that the result I get is cumulative probability?

Best Answer

For easier reading, I have combined three extensive Comments (now deleted) into an Answer:

You don't have the true PDF $f(x)$ from density in R. From the code, we know $X$ is standard normal, so the exact value of $p=P(0<X<1)$ could be found from in R as $0.3413447.$

diff(pnorm(c(0,1)))
[1] 0.3413447

However, I suppose you want to get $p$ from your $n=200$ observations a. The most direct way to do that is to find the proportion of values of a in (0,1):

set.seed(42)
a = rnorm(200) # generate random data

mean((a > 0) & (a < 1)) 
[1] 0.33

Alternatively, if you know data are normal, then you could estimate $μ,σ$ from data and use R's normal pdf function to get $0.3429.$

mu = mean(a);  sd = sd(a)
diff(pnorm(c(0,1), mu, sd)) 
[1] 0.3428732

I wouldn't expect a density estimator to do very much better.

The output of density in R is a sequence of 512 x-values and 512 y-values that can be used to plot the estimated PDF (enclosing unit area).

density(a)

Call:
        density.default(x = a)

Data: a (200 obs.);     Bandwidth 'bw' = 0.2895

       x                 y            
 Min.   :-3.8616   Min.   :0.0000781  
 1st Qu.:-2.0036   1st Qu.:0.0124148  
 Median :-0.1456   Median :0.0685148  
 Mean   :-0.1456   Mean   :0.1344194  
 3rd Qu.: 1.7124   3rd Qu.:0.2425435  
 Max.   : 3.5704   Max.   :0.4123541 

hist(a, prob=T, col="skyblue2")
 lines(density(a), col="brown", lwd=2)
 rug(a)

The figure below shows a histogram of a along with the density estimator. Tick marks along the horizontal axis show locations of the $n=200$ observations. [Sometimes density estimators are informally called 'smoothed histograms', but they are based on individual data points without reference to the binning of any histogram. The density estimator used here is the default estimator from density in R; variations are available via parameters not used here.]

enter image description here

You might try to use this output to estimate $p,$ as follows, to get $p \approx 0.337867.$

xx = density(a)$x; yy = density(a)$y
sum(yy[xx > 0 & xx < 1])/sum(yy)
[1] 0.337867

This method does have the advantage of not needing to know the population family of distributions (e.g., normal).


Addendum, showing results for a much larger sample: $n=10\,000.$

set.seed(2022)

# counting points
A = rnorm(10000)
mean((A > 0) & (A < 1))
[1] 0.3396

# assuming normality 
mu = mean(A);  sd = sd(A)
diff(pnorm(c(0,1), mu, sd))
[1] 0.3407714

# density estimation 
xx = density(A)$x;  yy = density(A)$y
sum(yy[xx > 0 & xx < 1])/sum(yy)
[1] 0.3373269
Related Question