Solved – How to find 95% credible interval

bayesiancredible-intervaldescriptive statisticshighest-density-region

I am trying to compute the 95% credible interval of the following posterior distribution. I could not find the function in R for it but is the approach below correct?

x <- seq(0.4,12,0.4)
px <-  c(0,0, 0, 0, 0, 0, 0.0002, 0.0037, 0.018, 0.06, 0.22 ,0.43, 0.64,0.7579, 0.7870, 0.72, 0.555, 0.37, 0.24, 0.11, 0.07, 0.02, 0.009, 0.005, 0.0001, 0,0.0002, 0, 0, 0)
plot(x,px, type="l")
mm <- sum(x*px)/sum(px)
var <- (sum((x)^2*px)/sum(px)) - (mm^2)
cat("95% credible interval: ", round(mm -1.96*sqrt(var),3), "-", round(mm + 1.96*sqrt(var),3),"\n")

Best Answer

As noted by Henry, you are assuming normal distribution and it's perfectly ok if your data follows normal distribution, but will be incorrect if you cannot assume normal distribution for it. Below I describe two different approaches that you could use for unknown distribution given only datapoints x and accompanying density estimates px.

The first thing to consider is what exactly do you want to summarize using your intervals. For example, you could be interested in intervals obtained using quantiles, but you could also be interested in highest density region (see here, or here) of your distribution. While this should not make much (if any) difference in simple cases like symmetric, unimodal distributions, this will make a difference for more "complicated" distributions. Generally, quantiles will give you interval containing probability mass concentrated around the median (the middle $100\alpha\%$ of your distribution), while highest density region is a region around the modes of the distribution. This will be more clear if you compare the two plots on the picture below -- quantiles "cut" the distribution vertically, while highest density region "cuts" it horizontally.

Quantiles vs HDR intervals

Next thing to consider is how to deal with the fact that you have incomplete information about the distribution (assuming that we are talking about continuous distribution, you have only a bunch of points rather then a function). What you could do about it is to take the values "as is", or use some kind of interpolation, or smoothing, to obtain the "in between" values.

One approach would be to use linear interpolation (see ?approxfun in R), or alternatively something more smooth like splines (see ?splinefun in R). If you choose such approach you have to remember that interpolation algorithms have no domain knowledge about your data and can return invalid results like values below zero etc.

# grid of points
xx <- seq(min(x), max(x), by = 0.001)

# interpolate function from the sample
fx <- splinefun(x, px) # interpolating function
pxx <- pmax(0, fx(xx)) # normalize so prob >0

Second approach that you could consider is to use kernel density/mixture distribution to approximate your distribution using the data you have. The tricky part in here is to decide about optimal bandwidth.

# density of kernel density/mixture distribution
dmix <- function(x, m, s, w) {
  k <- length(m)
  rowSums(vapply(1:k, function(j) w[j]*dnorm(x, m[j], s[j]), numeric(length(x))))
}

# approximate function using kernel density/mixture distribution
pxx <- dmix(xx, x, rep(0.4, length.out = length(x)), px) # bandwidth 0.4 chosen arbitrary

Next, you are going to find the intervals of interest. You can either proceed numerically, or by simulation.

1a) Sampling to obtain quantile intervals

# sample from the "empirical" distribution
samp <- sample(xx, 1e5, replace = TRUE, prob = pxx)

# or sample from kernel density
idx <- sample.int(length(x), 1e5, replace = TRUE, prob = px)
samp <- rnorm(1e5, x[idx], 0.4) # this is arbitrary sd

# and take sample quantiles
quantile(samp, c(0.05, 0.975)) 

1b) Sampling to obtain highest density region

samp <- sample(pxx, 1e5, replace = TRUE, prob = pxx) # sample probabilities
crit <- quantile(samp, 0.05) # boundary for the lower 5% of probability mass

# values from the 95% highest density region
xx[pxx >= crit]

2a) Find quantiles numerically

cpxx <- cumsum(pxx) / sum(pxx)
xx[which(cpxx >= 0.025)[1]]   # lower boundary
xx[which(cpxx >= 0.975)[1]-1] # upper boundary

2b) Find highest density region numerically

const <- sum(pxx)
spxx <- sort(pxx, decreasing = TRUE) / const
crit <- spxx[which(cumsum(spxx) >= 0.95)[1]] * const

As you can see on the plots below, in case of unimodal, symmetric distribution both methods return the same interval.

Two kinds of intervals

Of course, you could also try to find $100\alpha\%$ interval around some central value such that $\Pr(X \in \mu \pm \zeta) \ge \alpha$ and use some kind of optimization to find appropriate $\zeta$, but the two approaches described above seem to be used more commonly and are more intuitive.

Related Question