Bayesian – Conducting Bayesian Logistic Regression for Multiple Categorical Variables One by One

bayesiancategorical datalogisticmany-categories

My main background knowledge about Bayesian analysis comes from Doing Bayesian Data Analysis by John K. Kruschke.

I have a dataset with observations y (success, fail) and several categorical variables A (treatment), B (age, in 5 groups), C, D, E, etc.

Firstly, I want to answer a question if the treatment has nonzero effect in improving the success rate. So I fit a model

$f = \beta_0 + \beta_1[\text{A's Index}]$

$y \sim \text{bernoulli}(\text{inverse_logit}(f))$,

then I made a group comparison of A (A0, A1, A2) to check if the the difference between each group of A is nonzero or not.

Then I add another variable B and the interaction of A and B and fit

$f = \beta_0 + \beta_1[\text{A's Index}]+\beta2[\text{B's Index}] + \beta_{A \times B}$

$y \sim \text{bernoulli}(\text{inverse_logit}(f))$,

and I analyzed the group differences of A, B, interaction term.

Then I want to put all the categorical variables together and interaction term $\beta_{A\times B}$ to fit a full model. Then I find out these variables show a significance
in between-group difference, which then I claim it's a contribution.

I want to know if this is Bayesian at all. As I have too many categorical variables, I also read this article. It reviewed several methods for variable selection; however, does it still not have very popular way for variable selection of categorical variables. My major task is to answer question of A correctly. Then I want to find out other variables also has a contribution on the success rate or improvement of success. Is there a standard way of doing this? Could anyone share a reference?

Best Answer

Whether the "difference is significant or not" is not part of Bayesian logic. It would be good to edit that part of the post.

It is not clear at all why you entertained more than one model. It's better to formulate a complete model, then to assess evidence for nonzero or large effects on parameters in that one model. And logistic model assumptions (here, additivity on the log odds scale) cannot be simultaneously true for a model and a sub-model.

So formulate a subject-matter-driven model, choose priors, and draw inferences from that model. From the multivariate posterior parameter samples you can easily make marginal assessments by concentrating on a single parameter at a time, or a derived parameter (combination of other parameters). For example you can compute $\Pr(\theta > 0)$ or $\Pr(|\theta| > 0.2)$.

Related Question