Solved – Calculate tail probabilities from density() call in R

density-estimationr

This question concerns how to implement the following problem in R.

x = rnorm(1000)
hist(x,freq=FALSE)
lines(density(x))

How would you calculate the upper (or lower) tail probability for a given cutoff (e.g. +1) given the density estimate above? NOTE: the following solution isn't good enough. I need the calculation based on the smoothed density curve, i.e. an integral of the curve, not the empirical histogram.

sum(x>1)/length(x)

Also please do not suggest the use any of the standard pnorm functions because they are only correct if the underlying distribution is correctly specified. Thanks!

Best Answer

I would take the same approach as @Flounderer, but exploit another feature of R's density() function; namely the from and to arguments, which restrict the density estimation to the region enclosed by the two arguments. This results in the same density estimates as running the function without from and/or to, but by restricting the range of the density estimate to the region of interest, we focus all of the n evaluation points on the region of interest.

set.seed(1)
x <-rnorm(1000)
hist(x,freq=FALSE)
lines( dens <- density(x) )
lines( dens2 <- density(x, from = 1, n = 1024), col = "red", lwd = 2)

This produces

enter image description here

The red line is to illustrate that the density estimates in dens and dens2 are the same for the region of interest.

Then you can follow the approach @Flounderer used to evaluate the tail probability:

> with(dens2, sum(y * diff(x)[1]))
[1] 0.1680759

The advantage of this approach is to expend the n observations at which density() evaluates the KDE all on the region of interest. The larger n the higher the resolution that you have in evaluating the tail probability.

Note from ?density that given the FFT used in the implementation, having n as a multiple of 2 is advantageous.

Related Solutions

Solved – Test if two samples follow the same distribution with Chi Squared in R

Given some set of cutpoints, the two-sample case becomes a chi-squared test of homogeneity of proportions (and this in turn is functionally identical to a test of independence in a $2\times k$ table).

How do I go about binning two samples using the same intervals?

choose some set of bins (if possible without reference to the data, though in practice that may be difficult to accomplish unless you know beforehand what the distribution is roughly going to be)
for each sample count the data in those bins

(in R you could use the cut function for setting up the bins and the table function for counting - but it's far from the only choice. If you really wanted to get hist to choose your bins then I'd combine the two samples into one for identifying your cut-offs, but then you still have to go back and do the counts for the individual samples; it may also leave you with some small expected counts, but if you work with just the marginal distribution you can at least combine bins that way without looking at how the individual counts would have split up)

A worked example:

set.seed(7687120)                # make sure we look at the same numbers
x=rgamma(40,6,1/6)               # generate some x,y data 
y=rgamma(30,9,1/5)               # from different distributions
xy=c(x,y)                        # combine into one sample
hist(xy)                         # default hist bins not really suitable 
summary(xy)
hist(xy),breaks=seq(15,105,15))  # some small category counts at the top end
bks=c(15,30,45,60,105)           # -- push together everything above 60

table(cut(xy,breaks=bks))        # marginal totals look reasonable to me 
xc=table(cut(x,breaks=bks))      # calc. individual counts in table for x
yc=table(cut(y,breaks=bks))      # corresponding counts for y
rbind(xc,yc)                     # what the table looks like
chisq.test(rbind(xc,yc))         # testing the result

Solved – get probabilities from kernel density estimation pdf

The problem here is that your question is contradictory. You are using a KDE with a continuous kernel, which means that you are estimating using a continuous distribution. For a continuous distribution, the probability of any outcome is zero (see e.g., here and here), so we usually measure by the probability density instead. However, you say that you want the probability of the point, not its density. You also make it clear that you want the probability of the individual point, not the probability of a neighbourhood containing that point.

Under these requirements, the estimated probability of the outcome is zero. This is not helpful, which is why we measure outcomes in a continuous distribution by their probability density instead of their probability.

Best Answer

Related Solutions

Solved – Test if two samples follow the same distribution with Chi Squared in R

Solved – get probabilities from kernel density estimation pdf

Related Question