Solved – Confidence intervals of the quantile of the Kernel density

confidence intervalkernel-smoothingquantiles

I have estimated the Kernel density for a set of values, for example

v<-rgamma(15,shape=400,scale=4.3)
z<-density(v)

Now, I construct a function that is the Kernel density of v, I have followed the next post:

Jarle Tufto (https://stats.stackexchange.com/users/77222/jarle-tufto), Quantile of kernel density estimator, URL (version: 2018-04-17): https://stats.stackexchange.com/q/341028

Now I am trying to construct a confidence interval for the quantile with respect to the KDE, is possible to do that? do you have any example or explanation?

Best Answer

If you need to come up with confidence intervals for some kind of complicated estimator, then the most simple and generic approach is to use bootstrap. Same as with constructing confidence intervals for the kernel densities themselves, just re-sample your data many times, estimate the quantity of interest on the data and then calculate the empirical quantiles from this simulated data to get the interval of interest.

Sidenote: As noticed by me and Nick Cox in comments to your previous question, using kernel density to calculate the quantiles, especially the extreme ones, is generally a bad idea. To convince yourself, look at the plot of (very poor) kernel density estimate with rectangular kernel and accompanying cumulative distribution generated using it. It was estimated given five data points (red dots below). Notice that if you used it to compute the quantiles, then for the small quantiles, the function will return zeros for everything smaller then the lowest observed value minus the bandwidth you used for the kernel. The estimates would completely depend on the chosen bandwidth and the observed data and would tell you almost nothing about the values outside of the range of the observed values.

The above problems actually can be nicely illustrated when you calculate the bootstrap confidence intervals for the estimated distribution. As you can see below, the most uncertain values on the extremes, get the narrowest intervals, where the intervals outside the range of the data have zero width. So the estimate for the distribution would be "certain" that anything below 0.2 have zero probability of occurrence, while being completely wrong about it (the data was actually sampled from the standard uniform distribution).

Related Solutions

Simulation – How to Simulate from Kernel Density Estimate (Empirical PDF)?

Here's an algorithm to sample from an arbitrary mixture $f(x) = \frac1N \sum_{i=1}^N f_i(x)$:

Pick a mixture component $i$ uniformly at random.
Sample from $f_i$.

It should be clear that this produces an exact sample.

A Gaussian kernel density estimate is a mixture $\frac1N \sum_{i=1}^N \mathcal{N}(x; x_i, h^2)$. So you can take a sample of size $N$ by picking a bunch of $x_i$s and adding normal noise with zero mean and variance $h^2$ to it.

Your code snippet is selecting a bunch of $x_i$s, but then it's doing something slightly different:

changing $x_i$ to $ \hat\mu + \frac{x_i - \hat\mu}{\sqrt{1 + h^2 / \hat\sigma^2}} $
adding zero-mean normal noise with variance $\frac{h^2}{1 + h^2/\hat\sigma^2} = \frac{1}{\frac{1}{h^2} + \frac{1}{\hat\sigma^2}}$, the harmonic mean of $h^2$ and $\sigma^2$.

We can see that the expected value of a sample according to this procedure is $$ \frac1N \sum_{i=1}^N \frac{x_i}{\sqrt{1 + h^2/\hat\sigma^2}} + \hat\mu - \frac{1}{\sqrt{1 + h^2 /\hat\sigma^2}} \hat\mu = \hat\mu $$ since $\hat\mu = \frac1N \sum_{i=1}^N x_i$.

I don't think the sampling disribution is the same, though.

Solved – Compute confidence interval for univariate Kernel Density Estimator

Here is a simple bootstrap approach to estimate (pointwise!) confidence bands around a kernel density estimate

$$\widehat f(x)=\frac{1}{nh}\sum^n_{i=1}k\left(\frac{x_i-x}{h}\right).$$ Basically, you just resample a bootstrap sample xstar from your original sample x ($=x_1,\ldots,x_n$), compute a new kernel density estimate dstar, do that B times and then compute quantiles from the resulting distribution at each evaluation point in the vector xax (all the different $x$ at which you desire a density estimate, I take evalpoints$=100$ equispaced points from $-3$ to $3$ here) of the kernel density estimates.

n <- 100
evalpoints <- 100
B <- 2500

x <- rnorm(n) # generate some data to play with
xax <- seq(-3,3,length.out = evalpoints)

d <- density(x,n=evalpoints,from=-3,to=3)

estimates <- matrix(NA,nrow=evalpoints,ncol=B)
for (b in 1:B){
  xstar <- sample(x,replace=TRUE)
  dstar <- density(xstar,n=evalpoints,from=-3,to=3)
  estimates[,b] <- dstar$y
}

ConfidenceBands <- apply(estimates, 1, quantile, probs = c(0.025, 0.975))
plot(d,lwd=2,col="purple",ylim=c(0,max(ConfidenceBands[2,]*1.01)),main="Pointwise bootstrap confidence bands")
lines(xax,dnorm(xax),col="gold",lwd=2) # the "truth" which we know in this simulation
xshade <- c(xax,rev(xax))
yshade <- c(ConfidenceBands[2,],rev(ConfidenceBands[1,]))
polygon(xshade,yshade, border = NA,col=adjustcolor("lightblue", 0.4))

Result:

Suggestions for references for nonparametric estimation can be found here: Book for introductory nonparametric econometrics/statistics

Best Answer

Related Solutions

Simulation – How to Simulate from Kernel Density Estimate (Empirical PDF)?

Solved – Compute confidence interval for univariate Kernel Density Estimator

Related Question