Solved – Calculate tail probabilities from density() call in R

density-estimationr

This question concerns how to implement the following problem in R.

x = rnorm(1000)
hist(x,freq=FALSE)
lines(density(x))

How would you calculate the upper (or lower) tail probability for a given cutoff (e.g. +1) given the density estimate above? NOTE: the following solution isn't good enough. I need the calculation based on the smoothed density curve, i.e. an integral of the curve, not the empirical histogram.

sum(x>1)/length(x)

Also please do not suggest the use any of the standard pnorm functions because they are only correct if the underlying distribution is correctly specified. Thanks!

Best Answer

I would take the same approach as @Flounderer, but exploit another feature of R's density() function; namely the from and to arguments, which restrict the density estimation to the region enclosed by the two arguments. This results in the same density estimates as running the function without from and/or to, but by restricting the range of the density estimate to the region of interest, we focus all of the n evaluation points on the region of interest.

set.seed(1)
x <-rnorm(1000)
hist(x,freq=FALSE)
lines( dens <- density(x) )
lines( dens2 <- density(x, from = 1, n = 1024), col = "red", lwd = 2)

This produces

enter image description here

The red line is to illustrate that the density estimates in dens and dens2 are the same for the region of interest.

Then you can follow the approach @Flounderer used to evaluate the tail probability:

> with(dens2, sum(y * diff(x)[1]))
[1] 0.1680759

The advantage of this approach is to expend the n observations at which density() evaluates the KDE all on the region of interest. The larger n the higher the resolution that you have in evaluating the tail probability.

Note from ?density that given the FFT used in the implementation, having n as a multiple of 2 is advantageous.

Related Question