Solved – Why use a beta distribution on the Bernoulli parameter for hierarchical logistic regression

bayesianlogisticmultilevel-analysisregression

I'm currently reading Kruschke's excellent "Doing Bayesian Data Analysis" book. However, the chapter on hierarchical logistic regression (Chapter 20) is somewhat confusing.

Figure 20.2 describes a hierarchical logistic regression where the Bernoulli parameter is defined as a the linear function on the coefficients transformed through a sigmoid function. This seems to be the way hierarchical logistic regression is posed in most of the examples I've seen in other sources online as well. For instance – http://polisci2.ucsd.edu/cfariss/code/SIMlogit02.bug

However, when the predictors are nominal, he adds a layer in the hierarchy – the Bernoulli parameter is now drawn from a beta distribution (Figure 20.5) with parameters determined by mu and kappa, where mu is the sigmoid transformation of the linear function of coefficients, and kappa uses a gamma prior.

This seems reasonable and analogous to the coin-flipping example from chapter 9, but I don't see what having nominal predictors has to do with adding a beta distribution. Why wouldn't one do this in the case of metric predictors and why was the beta distribution added for the nominal predictors?

EDIT: Clarification on the models I'm referring to. First, a logistic regression model with metric predictors (no beta prior). This is similar to other examples of hierarchical logistic regression, such as the bugs example above:

$$
y_i \sim \operatorname{Bernoulli}(\mu_i) \\
\mu_i = \operatorname{sig}(\beta_0 + \sum_j \beta_j x_{ji} ) \\
\beta_0 \sim N(M_0, T_0) \\
\beta_j \sim N(M_\beta, T_\beta) \\
$$

Then the example with nominal predictors. Here's where I don't quite understand the role of the "lower" level of the hierarchy (incorporating the logistic outcome into a beta prior for a binomial) and why it should be different than the metric example.

$$
z_i \sim \operatorname{Bin}(\theta_i, N) \\
\theta_i \sim \operatorname{Beta}(a_j, b_j) \\
a_j = \mu_j \kappa \\
b_j = (1- \mu_j) \kappa \\
\kappa \sim \Gamma(S_\kappa, R_\kappa) \\
\mu_j = \operatorname{sig}(\beta_0 + \sum_j \beta_j x_{ji} ) \\
\beta_0 \sim N(M_0, T_0) \\
\beta_j \sim N(0, \tau_\beta) \\
\tau_\beta = 1/\sigma_{\beta}^2 \\
\sigma_{\beta}^2 \sim \operatorname{folded t} (T_t, DF)
$$

Best Answer

The two models you compare have many extraneous features, and I think you can restate your question more clearly in the context of the following two simplified models:

Model 1:

\begin{align} y_i | \mu_i &\sim \operatorname{Bern}( \mu_i ) \\ \mu_i &\sim \pi(\mu_i) \end{align}

Model 2:

\begin{align} y_i | \theta_i & \sim \operatorname{Bern}( \theta_i ) \\ \theta_i | \mu_i,\kappa &\sim \operatorname{Beta}\big( \mu_i\kappa, (1-\mu_i)\kappa \big) \\ \mu_i&\sim \pi(\mu_i) \end{align}

Your questions are: (1) what role is played by the beta distribution; and related, (2) how (if at all) is Model 2 different from Model 1?

On the surface these appear to be pretty different models, but in fact, the marginal distributions of $\mu_i$ in both models are identical. The posterior distribution of $\mu_i$ in Model 1 is \begin{gather} p(\mu_i|y_i) \propto \mu_i^{y_i}(1-\mu_i)^{1-y_i}\pi(\mu_i) \end{gather} whereas the marginal posterior distribution of $\mu_i$ in Model 2 is: \begin{align} p(\mu_i|y_i,\kappa) &\propto \int^1_0 \frac{\theta_i^{y_i + \mu_i\kappa - 1}(1-\theta_i)^{\kappa(1-\mu_i)-y_i}}{B\big(\kappa\mu_i,\kappa(1-\mu_i)\big)} d\theta \,\pi(\mu_i) \\ &\propto \frac{B\big(y_i+\mu_i\kappa,1-y_i+\kappa(1-\mu_i)\big)\pi(\mu_i) }{B\big(\kappa\mu_i,\kappa(1-\mu_i)\big)} \\ &\propto \mu_i^{y_i}(1-\mu_i)^{1-y_i} \pi(\mu_i) \end{align}

Thus any advantage gained from using Model 2 is computational. Overparameterizing hierarchical models, such as the addition of $\theta_i$ in Model 2, can sometimes improve the efficiency of the sampling procedure; for example, by introducing conditionally conjugate relationships between groups of parameters (see Jack Tanner's answer), or by breaking correlation among parameters of interest (google "Parameter Expansion").

Best Answer

Related Solutions

Bayesian – How to Interpret Autocorrelation Plot in MCMC

Solved – How is logistic regression related to Bernoulli Distribution

Related Question