Solved – Why use a beta distribution on the Bernoulli parameter for hierarchical logistic regression

bayesianlogisticmultilevel-analysisregression

I'm currently reading Kruschke's excellent "Doing Bayesian Data Analysis" book. However, the chapter on hierarchical logistic regression (Chapter 20) is somewhat confusing.

Figure 20.2 describes a hierarchical logistic regression where the Bernoulli parameter is defined as a the linear function on the coefficients transformed through a sigmoid function. This seems to be the way hierarchical logistic regression is posed in most of the examples I've seen in other sources online as well. For instance – http://polisci2.ucsd.edu/cfariss/code/SIMlogit02.bug

However, when the predictors are nominal, he adds a layer in the hierarchy – the Bernoulli parameter is now drawn from a beta distribution (Figure 20.5) with parameters determined by mu and kappa, where mu is the sigmoid transformation of the linear function of coefficients, and kappa uses a gamma prior.

This seems reasonable and analogous to the coin-flipping example from chapter 9, but I don't see what having nominal predictors has to do with adding a beta distribution. Why wouldn't one do this in the case of metric predictors and why was the beta distribution added for the nominal predictors?

EDIT: Clarification on the models I'm referring to. First, a logistic regression model with metric predictors (no beta prior). This is similar to other examples of hierarchical logistic regression, such as the bugs example above:

$$
y_i \sim \operatorname{Bernoulli}(\mu_i) \\
\mu_i = \operatorname{sig}(\beta_0 + \sum_j \beta_j x_{ji} ) \\
\beta_0 \sim N(M_0, T_0) \\
\beta_j \sim N(M_\beta, T_\beta) \\
$$

Then the example with nominal predictors. Here's where I don't quite understand the role of the "lower" level of the hierarchy (incorporating the logistic outcome into a beta prior for a binomial) and why it should be different than the metric example.

$$
z_i \sim \operatorname{Bin}(\theta_i, N) \\
\theta_i \sim \operatorname{Beta}(a_j, b_j) \\
a_j = \mu_j \kappa \\
b_j = (1- \mu_j) \kappa \\
\kappa \sim \Gamma(S_\kappa, R_\kappa) \\
\mu_j = \operatorname{sig}(\beta_0 + \sum_j \beta_j x_{ji} ) \\
\beta_0 \sim N(M_0, T_0) \\
\beta_j \sim N(0, \tau_\beta) \\
\tau_\beta = 1/\sigma_{\beta}^2 \\
\sigma_{\beta}^2 \sim \operatorname{folded t} (T_t, DF)
$$

Best Answer

The two models you compare have many extraneous features, and I think you can restate your question more clearly in the context of the following two simplified models:

Model 1:

\begin{align} y_i | \mu_i &\sim \operatorname{Bern}( \mu_i ) \\ \mu_i &\sim \pi(\mu_i) \end{align}

Model 2:

\begin{align} y_i | \theta_i & \sim \operatorname{Bern}( \theta_i ) \\ \theta_i | \mu_i,\kappa &\sim \operatorname{Beta}\big( \mu_i\kappa, (1-\mu_i)\kappa \big) \\ \mu_i&\sim \pi(\mu_i) \end{align}

Your questions are: (1) what role is played by the beta distribution; and related, (2) how (if at all) is Model 2 different from Model 1?

On the surface these appear to be pretty different models, but in fact, the marginal distributions of $\mu_i$ in both models are identical. The posterior distribution of $\mu_i$ in Model 1 is \begin{gather} p(\mu_i|y_i) \propto \mu_i^{y_i}(1-\mu_i)^{1-y_i}\pi(\mu_i) \end{gather} whereas the marginal posterior distribution of $\mu_i$ in Model 2 is: \begin{align} p(\mu_i|y_i,\kappa) &\propto \int^1_0 \frac{\theta_i^{y_i + \mu_i\kappa - 1}(1-\theta_i)^{\kappa(1-\mu_i)-y_i}}{B\big(\kappa\mu_i,\kappa(1-\mu_i)\big)} d\theta \,\pi(\mu_i) \\ &\propto \frac{B\big(y_i+\mu_i\kappa,1-y_i+\kappa(1-\mu_i)\big)\pi(\mu_i) }{B\big(\kappa\mu_i,\kappa(1-\mu_i)\big)} \\ &\propto \mu_i^{y_i}(1-\mu_i)^{1-y_i} \pi(\mu_i) \end{align}

Thus any advantage gained from using Model 2 is computational. Overparameterizing hierarchical models, such as the addition of $\theta_i$ in Model 2, can sometimes improve the efficiency of the sampling procedure; for example, by introducing conditionally conjugate relationships between groups of parameters (see Jack Tanner's answer), or by breaking correlation among parameters of interest (google "Parameter Expansion").

Related Question