I am trying to do a "round trip" to verify that my code works with regards to recovering parameters of normal distributions.
I'm using this document regarding the normal-gamma distribution:
http://www.cs.ubc.ca/~murphyk/Teaching/CS340-Fall07/reading/NG.pdf
The basic flow is to generate 100,000 mean and variance parameters from the normal-gamma with prior parameters given. Then for each mean-variance parameter pair, randomly generate 100 samples. Next, using the prior parameters and the data, develop 95% credible intervals for each of the 100,000 mean-variance pairs. And finally create 95% credible intervals for the posterior predictive distribution and compare these to a randomly generated observation.
What's interesting is that the mean and variance CIs never seem to contain the actual parameter 95% of the time, but the posteriror predictive interval seems fine. Is this expected? Here is the R code:
rm(list = ls(all = TRUE))
set.seed(115)
no_of_examples = 100000
examples = seq(1:no_of_examples)
#priors
m_0 = 0
k_0 = 6
a_0 = 10
b_0 = 840
means = rnorm(no_of_examples, m_0, sqrt(k_0))
vars = 1/rgamma(no_of_examples, a_0, b_0)
samples = 100
data = matrix(NA, no_of_examples, samples)
for (example in examples){
data[example, ] = rnorm(samples,means[example], sqrt(vars[example]))
}
# now recover parameters
m_n = rep(NA, no_of_examples)
k_n = rep(NA, no_of_examples)
a_n = rep(NA, no_of_examples)
b_n = rep(NA, no_of_examples)
for (example in examples){
n = samples
x_bar = mean(data[example,])
sample_mean = rep(mean(data[example,]),samples)
m_n[example] = (k_0 * m_0 + n * x_bar)/(k_0 + n)
k_n[example] = (k_0 + n)
a_n[example] = a_0 + n/2
b_n[example] = b_0 + 1/2*(sum((data[example,]-sample_mean)^2))+(k_0 * n * (sample_mean[1] - m_0)^2)/(2 * (k_0 + n))
}
# check posterior of means
conf = matrix(NA,no_of_examples,2)
in_range = rep(NA,no_of_examples)
for (example in examples){
conf[example,] = qt(c(0.025,0.975),2*a_n[example]) * sqrt(b_n[example]/(a_n[example]*k_n[example])) + m_n[example]
}
in_range = ifelse(means > conf[,1],ifelse(means < conf[,2],1,0), 0)
print("% of means within 95% CI:")
print(sum(in_range)/length(in_range))
# check posterior of variances
ci_v = matrix(NA, no_of_examples, 2)
for (example in examples){
ci_v[example,] = (1/qgamma(c(0.975,0.025), a_n[example], b_n[example]))
}
in_range_v = ifelse(vars > ci_v[,1],ifelse(vars < ci_v[,2],1,0), 0)
print("% of variance within 95% CI:")
print(sum(in_range_v)/length(in_range_v))
# check posterior predictive distribution
y_p = rep(NA, no_of_examples)
ci_p = matrix(NA, no_of_examples, 2)
for(example in examples){
# generate an observation
y_p[example] = rnorm(1, means[example], sqrt(vars[example]))
# find 95% CI for posterior predictive
ci_p[example,] = qt(c(0.025,0.975),2*a_n[example]) * sqrt((b_n[example]*(k_n[example] + 1))/(a_n[example]*k_n[example])) + m_n[example]
}
in_range_p = ifelse(y_p > ci_p[,1],ifelse(y_p < ci_p[,2],1,0), 0)
print("% of observations within posterior predictive CI:")
print(sum(in_range_p)/length(in_range_p))
My results are:
[1] "% of means within 95% CI:"
[1] 0.94122
[1] "% of variance within 95% CI:"
[1] 0.93993
[1] "% of observations within posterior predictive CI:"
[1] 0.95123
Is there an error in my code? Or am I completely misunderstanding credible intervals?
Update: I edited a small error in the b_n update, and got a little bit closer on the variance CI, but it still seems to be consistently below 95% (when I change the seed). The percentage of means and variances falling within the 95% credible interval both seem to be right around 94%.
Best Answer
Fixed it. Edited code now works properly, and my new results are:
Hopefully this code will help someone else.