Solved – How to summarize credible intervals for a medical audience

bayesiancredible-intervalmedicinestanstatistical significance

With Stan and frontend packages rstanarm or brms I can easily analyze data the Bayesian way as I did before with mixed-models such as lme. While I have most of the book and articles by Kruschke-Gelman-Wagenmakers-etc on my desk, these don't tell me how to summarize results for a medical audience, torn between the Skylla of Bayesian's wrath and the Charybdis of medical reviewers ("we want significances, not that diffuse stuff").

An example: Gastric frequency (1/min) is measured in three groups; healthy controls are the reference. There are several measurements for each participant, so à la frequentist I used the following mixed-model lme:

summary(lme(freq_min~ group, random = ~1|study_id, data = mo))

Slightly edited results:

Fixed effects: freq_min ~ group 
                   Value Std.Error DF t-value p-value
(Intercept)        2.712    0.0804 70    33.7  0.0000
groupno_symptoms   0.353    0.1180 27     3.0  0.0058
groupwith_symptoms 0.195    0.1174 27     1.7  0.1086

For simplicity, I will use 2* std error as 95% CI.

In frequentist context, I would have summarized this as:

In the control group the estimated frequency was 2.7/min (maybe add CI here, but I avoid this sometimes because of the confusion created by absolute and difference CI).
In the no_symptoms group, the frequency was higher by 0.4/min, CI(0.11 to 0.59)/min, p = 0.006 than control.
In the with_symptoms group, the frequency was higher by 0.2/min, CI(-0.04 to 0.4)/min, p = 0.11 than control.

This is about the maximum acceptable complexity for a medical publication, the reviewer will probably ask me to add "not significant" in the second case.

Here is the same with stan_lmer and default priors.

freq_stan = stan_lmer(freq_min~ group + (1|study_id), data = mo)


           contrast lower_CredI frequency upper_CredI
        (Intercept)     2.58322     2.714       2.846
   groupno_symptoms     0.15579     0.346       0.535
 groupwith_symptoms    -0.00382     0.188       0.384

where CredI are 90% credible intervals (see the rstanarm vignette why 90% is used as default.)

Questions:

How to translate the above summary to the Bayesian world?
To what extent is prior-discussion required? I am quite sure the paper will come back with the usual "subjective assumption" when I mention priors; or at least with "no technical discussion, please". But all Bayesian authorities request that interpretation is only valid in the context of priors.
How can I deliver some "significance" surrogate in formulation, without betraying Bayesian concepts? Something like "credibly different" (uuuh…) or almost credibly different (buoha…, sounds like "at the brim of significance).

Jonah Gabry and Ben Goodrich (2016). rstanarm: Bayesian Applied Regression
Modeling via Stan. R package version 2.9.0-3.
https://CRAN.R-project.org/package=rstanarm

Stan Development Team (2015). Stan: A C++ Library for Probability and
Sampling, Version 2.8.0. URL http://mc-stan.org/.

Paul-Christian Buerkner (2016). brms: Bayesian Regression Models using Stan.
R package version 0.8.0. https://CRAN.R-project.org/package=brms

Pinheiro J, Bates D, DebRoy S, Sarkar D and R Core Team (2016). nlme: Linear
and Nonlinear Mixed Effects Models. R package version 3.1-124, http://CRAN.R-project.org/package=nlme>.

Best Answer

Quick thoughts:

1) The key issue is what applied question you are trying to answer for your audience, because that determines what information you want from your statistical analysis. In this case, it seems to me that you want to estimate the magnitude of differences between groups (or perhaps the magnitude of ratios of the groups if that is the measure more familiar to your audience). The magnitude of differences is not directly provided by the analyses you presented in the question. But it is straight forward to get what you want from the Bayesian analysis: you want the posterior distribution of the differences (or ratios). Then, from the posterior distribution of the differences (or ratios), you can make a direct probability statement such as this:

"The 95% most credible differences fall between [low 95% HDI limit] and [high 95% HDI limit]" (here I'm using the 95% highest density interval [HDI] as the credible interval, and because those are by definition the highest density parameter values they are glossed as 'most credible')

A medical-journal audience would intuitively and correctly understand that statement, because it's what the audience typically thinks is the meaning of a frequentist confidence interval (even though that's not meaning of a frequentist confidence interval).

How do you get the differences (or ratios) from Stan or JAGS? Merely by post-processing of the completed MCMC chain. At each step in the chain, compute the relevant differences (or ratios), then examine the posterior distribution of the differences (or ratios). Examples are given in DBDA2E https://sites.google.com/site/doingbayesiandataanalysis/ for MCMC generally in Figure 7.9 (p. 177), for JAGS in Figure 8.6 (p. 211), and for Stan in Section 16.3 (p. 468), etc.!

2) If you are compelled by tradition to make a statement about whether or not a difference of zero is rejected, you have two Bayesian options.

2A) One option is to make probability statements regarding intervals near zero, and their relation to the HDI. For this, you set up a region of practical equivalence (ROPE) around zero, which is merely a decision threshold appropriate for your applied domain --- how big of a difference is trivially small? Setting such boundaries is routinely done in clinical non-inferiority testing, for example. If you have an 'effect size' measure in your field, there might be conventions for 'small' effect size, and the ROPE limits could be, say, half of a small effect. Then you can make direct probability statements such as these:

"Only 1.2% of the posterior distribution of differences is practically equivalent to zero"

and

"The 95% most credible differences are all not practically equivalent to zero (i.e., the 95% HDI and ROPE do not overlap) and therefore we reject zero." (notice the distinction between the probability statement from the posterior distribution, versus the subsequent decision based on that statement)

You can also accept a difference of zero, for practical purposes, if the 95% most credible values are all practically equivalent ot zero.

2B) A second Bayesian option is Bayesian null hypothesis testing. (Notice that the method above was not called "hypothesis testing"!) Bayesian null hypothesis testing does a Bayesian model comparison of a prior distribution that assumes the difference can only be zero against an alternative prior distribution that assumes the difference could be some diffuse range of possibilities. The result of such a model comparison (usually) depends very strongly on the particular choice of alternative distribution, and so careful justification must be made for the choice of alternative prior. It is best to use at-least-mildly-informed priors for both the null and alternative so that the model comparison is genuinely meaningful. Note that the model comparison provides different information than estimation of differences between groups because the model comparison is addressing a different question. Thus, even with a model comparison, you will still want to provide the posterior distribution of magnitude of differences between groups because the your audience will want to know the magnitude of difference and its uncertainty (credible interval) regardless of whether or not you decided to reject or accept a difference of zero.

There might be ways to do a Bayesian null hypothesis test from the Stan/JAGS/MCMC output, but I do not know in this case. For example, one could try a Savage-Dickey approximation to a Bayes factor, but that would rely on knowing the prior density on the differences, which would require some mathematical analysis or some additional MCMC approximation from the prior.

The two methods for deciding about null values are discussed in Ch. 12 of DBDA2E https://sites.google.com/site/doingbayesiandataanalysis/. But I really don't want this discussion to get side-tracked by a debate about the "proper" way to assess null values; they're just different and they provide different information. The main point of my reply is point 1, above: Look at the posterior distribution of the differences between groups.

Related Solutions

Solved – Computing Bayesian Credible Intervals for Bayesian Regression

Unfortunately, the interval that you are looking for is not uniquely determined. Essentially, what you need is the Posterior Predictive Density (PPD, see https://en.wikipedia.org/wiki/Posterior_predictive_distribution), which is the density function of new/unseen data given the observed data. This PPD depends on the posterior distribution of the parameters, $\theta_1, ..., \theta_n$ in your case. It can be written as

$p(y^* | y, x, x^*) = \int p(y^*, \theta | y, x, x^*) d\theta = \int p(y^* | \theta) p (\theta | y, x, x^*)d\theta$

where $y^*$ represents the unseen response data, $y$ represents the known response data, $x$ and $x^*$ represent the predictor values that correspond to $y$ and $y^*$, and $\theta$ represents the parameters. The last factor in the final integral is the posterior distribution of $\theta$ given $y, x, x^*$. As the PPD depends on the posterior distribution of the parameters, this, in turn, depends on the prior distribution of the parameters (and on the chosen data model / likelihood function). This means that for each prior you may choose, your posterior distribution changes (and, as a result, your interval as well).

Usually, when choosing completely uninformative (i.e. flat) priors, along with a Normal likelihood for the response values given the predictors, the results of a Bayesian analysis overlap with those of a frequentist analysis. Then again, flat priors are usually a poor choice for such a model.

When you know which priors you want to use for your analysis, it may be possible to compute the PPD analytically, but in many cases this is simply impossible. I'd recommend using a tool like Stan (http://mc-stan.org) to draw samples from the posterior distribution and then use those to determine a credible interval for your parameters and your new (simulated) data.

Hope this helps!

Bayesian GLM – What Would Be a Bayesian Equivalent for Mixed-Effects Logistic Regression Model

Stan is the state-of-the-art in Bayesian model fitting. It has an official R interface through rstan. With rstan you would need to learn how to write your models in the Stan language. Alternatively, Stan also provides the rstanarm package (hat-tip to @ben-bolker for pointing out the omission), through which you can write your models in the familiar lme4-style syntax. An equally user-friendly interface for Stan is the R package brms which is in addition very flexible to handle models that should satisfy basic and moderately advanced users.

For example, in your case the syntax would be exactly the same:

m <- brm(Shop ~ Time + Group + Time:Group + (1 | subj), 
       data = Shopping, family = binomial)

or more concretely (same would work with glmer as well)

m <- brm(Shop ~ Time*Group + (1 | subj), 
       data = Shopping, family = binomial)

This model in brms will assume reasonable defaults for the prior distributions but you are encouraged to select your own.

The syntax for basic models such as the one you give as an example is going to be the same between rstanarm and brms. The advantage of using rstanarmto fit these basic models is that it comes with pre-compiled Stan code so it is going to run faster than brms that needs to compile its Stan code for every model. To name a few distinguishing features, brms shines due to its extended support for different distributions (e.g. "zero-inflated beta", "von Mises", categorical), its extended syntax to cover cases where the user needs to model e.g. predictor or outcome (as in meta-analyses) measurement error, and its ability to fit distributional regressions, non-linear models, or mixture models. For a more extensive comparison of R packages for Bayesian analysis have a look at Bürkner 2018.

Since you are a newcomer to Bayesian models, I would also highly encourage you to read the book "Statistical Rethinking" which also comes with its own R package, rethinking that is also an excellent choice, although not as remarkably user-friendly and flexible as brms. There's even a version of the book adapted for brms.

_{**References**
[Paul-Christian Bürkner, The R Journal (2018) 10:1, pages 395-411.][1]}

Best Answer

Related Solutions

Solved – Computing Bayesian Credible Intervals for Bayesian Regression

Bayesian GLM – What Would Be a Bayesian Equivalent for Mixed-Effects Logistic Regression Model

Related Question