Bayesian – The Influence of Bayesian Priors in RJAGS for Categorical Variables

bayesianjagsrregression

I am a bad statistician and new using Bayesian tools, and I am learning how to make regressions with rjags. While toying, I came across a situation I do not fully understand, and I would need some help from the community.

the example:

library(data.table)
library(ggplot2)
library(rjags)
library(magrittr)

global_slope <- 1
global_intercept <- 1

Npoints_per_group <- 50
N_groups <- 10
pentes <- rnorm(N_groups,-1,.5)

centers_x <- seq(0,10,length = N_groups)
center_y <- global_slope*centers_x + global_intercept

group_spread <- 2
group_names <- sample(LETTERS,N_groups)

df <- lapply(1:N_groups,function(i){
  x <- seq(centers_x[i]-group_spread/2,centers_x[i]+group_spread/2,length = Npoints_per_group)
  y <- pentes[i]*(x- centers_x[i])+center_y[i]+rnorm(Npoints_per_group)
  data.table(x = x,y = y,ID = group_names[i])
}) %>% rbindlist()

It creates a dataset with a classical Simpson paradox:

ggplot(df,aes(x,y,color = as.factor(ID)))+
  geom_point()

enter image description here

I can use mixed effect regression, but in a first step I just wanted to perform a regression adjusted for the ID categorical variable:

lm(y ~ x + ID,data = df) %>% summary()

lm(formula = y ~ x + ID, data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.1638 -0.7172 -0.0060  0.6544  2.9827 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.32405    0.16884  19.687   <2e-16 ***
x           -0.99256    0.07785 -12.750   <2e-16 ***
IDG         17.79467    0.72174  24.655   <2e-16 ***
IDH          8.88795    0.40220  22.098   <2e-16 ***
IDN         15.26829    0.63928  23.884   <2e-16 ***
IDS         -2.05924    0.22256  -9.252   <2e-16 ***
IDT         11.07778    0.47865  23.144   <2e-16 ***
IDU          2.11935    0.22256   9.522   <2e-16 ***
IDW         13.28410    0.55804  23.805   <2e-16 ***
IDY          4.20289    0.26829  15.665   <2e-16 ***
IDZ          6.61491    0.33074  20.000   <2e-16 ***

Now I want to do the same with rjags. I do the following:

model_code <- 
"model
{
  
  for (i in 1:n) {
    y[i] ~ dnorm(mu[i], sigma^-2)
    mu[i] <- alpha + beta * x[i] + bID[ID[i]]
  }
  
  # cat distrib
  bID[1] <- 0
  for(j in 2:Nid){
  bID[j] ~ dunif(-5,5)
  }
  
  # Priors
  alpha ~ dnorm(0, 100^-2)
  beta ~ dnorm(0, 100^-2)
  sigma ~ dunif(0, 10)
}
"


df[,ID_num := as.numeric(as.factor(ID))]
model_data <- list(n = nrow(df), 
                   Nid = df[,uniqueN(ID)],
                   ID = df$ID_num,
               y = df$y, 
                   x = df$x)

model <- jags.model(textConnection(model_code),
                    data = model_data,
                    n.chains = 2,
                    n.adapt = 1000)
update(model, 1000)

samples <- coda.samples(    model,
    c("alpha", "beta", "sigma","bID"),
    1000)

summary(samples)

This does not give the expected results:

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

            Mean      SD  Naive SE Time-series SE
alpha    2.38553 0.20216 0.0045205       0.023001
bID[1]   0.00000 0.00000 0.0000000       0.000000
bID[2]   4.94015 0.05910 0.0013216       0.003490
bID[3]   2.45937 0.22747 0.0050864       0.009988
bID[4]   4.42102 0.24729 0.0055295       0.012415
bID[5]  -1.11753 0.27499 0.0061490       0.022423
bID[6]   3.17433 0.23565 0.0052692       0.011053
bID[7]   0.09581 0.24132 0.0053960       0.014685
bID[8]   3.90678 0.25029 0.0055967       0.012346
bID[9]   0.71809 0.23663 0.0052912       0.012584
bID[10]  1.66045 0.23142 0.0051747       0.011332
beta     0.33399 0.02955 0.0006607       0.003277
sigma    1.31688 0.04148 0.0009275       0.001146

I realized that it was because the prior I gave for the categorical variable ID is too narrow. If I replace

bID[j] ~ dunif(-5,5)

by

bID[j] ~ dunif(-100,100)

in the JAGS model, then I find results similar to those of the frequentist lm:

1. Empirical mean and standard deviation for each variable,
   plus standard error of the mean:

          Mean      SD  Naive SE Time-series SE
alpha    3.292 0.16957 0.0037918      0.0194214
bID[1]   0.000 0.00000 0.0000000      0.0000000
bID[2]  17.622 0.66571 0.0148857      0.2021913
bID[3]   8.805 0.35826 0.0080109      0.0892269
bID[4]  15.123 0.57542 0.0128667      0.1739097
bID[5]  -2.025 0.22548 0.0050420      0.0194497
bID[6]  10.969 0.43419 0.0097089      0.1219498
bID[7]   2.107 0.21089 0.0047156      0.0170001
bID[8]  13.156 0.51186 0.0114454      0.1450026
bID[9]   4.165 0.24613 0.0055036      0.0381072
bID[10]  6.545 0.29735 0.0066490      0.0548702
beta    -0.972 0.07282 0.0016284      0.0193806
sigma    1.027 0.03268 0.0007307      0.0009942

So the question:

I was expecting that even if the priors are wrong or too narrow (my bID[j] ~ dunif(-5,5)), the model would at the end converge to the expected solution if it has enough iterations, as each step posterior distribution os based on the data. But it seems it does not.
The MCMC seems to have converged (to the wrong result), and increasing the number of burning does not alter the result.

Why is that ? Do the priors constrain the possible results whatever the number of iterations ? Or is there a kind of local minima reached before the actual true result ?

Best Answer

If you set as a prior for the ID coefficients a uniform distribution between -5 and 5, this means that these coefficients are assumed to be in the interval $[-5,5]$, other values are impossible. These cannot become possible during the MCMC algorithm, because you have specified these values as a priori impossible. As the frequentist analysis estimates these values outside that interval, the Bayesian analysis will necessarily deviate quite a bit from the frequentist one.

The way you generated the data implies (as confirmed by the frequentist analysis) that there are so big differences between the ID means that $[-5,5]$ is not enough of a range to fit them. In other words, choosing the prior like this actively prevents the Bayesian analysis from finding the "correct" coefficient values. This is no longer the case for a uniform on $[-100,100]$.

By the way, also the priors for alpha and beta look very restrictive, even though the normal distribution at least doesn't make any value totally impossible. In a proper Bayesian analysis you need to choose the priors so that they allow the data to have enough influence to put the parameters anywhere where they could realistically occur. Overprecision in the priors is not good.

Related Question