I am trying to compute the 95% credible interval of the following posterior distribution. I could not find the function in R for it but is the approach below correct?
x <- seq(0.4,12,0.4)
px <- c(0,0, 0, 0, 0, 0, 0.0002, 0.0037, 0.018, 0.06, 0.22 ,0.43, 0.64,0.7579, 0.7870, 0.72, 0.555, 0.37, 0.24, 0.11, 0.07, 0.02, 0.009, 0.005, 0.0001, 0,0.0002, 0, 0, 0)
plot(x,px, type="l")
mm <- sum(x*px)/sum(px)
var <- (sum((x)^2*px)/sum(px)) - (mm^2)
cat("95% credible interval: ", round(mm -1.96*sqrt(var),3), "-", round(mm + 1.96*sqrt(var),3),"\n")
Best Answer
As noted by Henry, you are assuming normal distribution and it's perfectly ok if your data follows normal distribution, but will be incorrect if you cannot assume normal distribution for it. Below I describe two different approaches that you could use for unknown distribution given only datapoints
x
and accompanying density estimatespx
.The first thing to consider is what exactly do you want to summarize using your intervals. For example, you could be interested in intervals obtained using quantiles, but you could also be interested in highest density region (see here, or here) of your distribution. While this should not make much (if any) difference in simple cases like symmetric, unimodal distributions, this will make a difference for more "complicated" distributions. Generally, quantiles will give you interval containing probability mass concentrated around the median (the middle $100\alpha\%$ of your distribution), while highest density region is a region around the modes of the distribution. This will be more clear if you compare the two plots on the picture below -- quantiles "cut" the distribution vertically, while highest density region "cuts" it horizontally.
Next thing to consider is how to deal with the fact that you have incomplete information about the distribution (assuming that we are talking about continuous distribution, you have only a bunch of points rather then a function). What you could do about it is to take the values "as is", or use some kind of interpolation, or smoothing, to obtain the "in between" values.
One approach would be to use linear interpolation (see
?approxfun
in R), or alternatively something more smooth like splines (see?splinefun
in R). If you choose such approach you have to remember that interpolation algorithms have no domain knowledge about your data and can return invalid results like values below zero etc.Second approach that you could consider is to use kernel density/mixture distribution to approximate your distribution using the data you have. The tricky part in here is to decide about optimal bandwidth.
Next, you are going to find the intervals of interest. You can either proceed numerically, or by simulation.
1a) Sampling to obtain quantile intervals
1b) Sampling to obtain highest density region
2a) Find quantiles numerically
2b) Find highest density region numerically
As you can see on the plots below, in case of unimodal, symmetric distribution both methods return the same interval.
Of course, you could also try to find $100\alpha\%$ interval around some central value such that $\Pr(X \in \mu \pm \zeta) \ge \alpha$ and use some kind of optimization to find appropriate $\zeta$, but the two approaches described above seem to be used more commonly and are more intuitive.