Solved – How to summarize credible intervals for a medical audience

bayesiancredible-intervalmedicinestanstatistical significance

With Stan and frontend packages rstanarm or brms I can easily analyze data the Bayesian way as I did before with mixed-models such as lme. While I have most of the book and articles by Kruschke-Gelman-Wagenmakers-etc on my desk, these don't tell me how to summarize results for a medical audience, torn between the Skylla of Bayesian's wrath and the Charybdis of medical reviewers ("we want significances, not that diffuse stuff").

An example: Gastric frequency (1/min) is measured in three groups; healthy controls are the reference. There are several measurements for each participant, so à la frequentist I used the following mixed-model lme:

summary(lme(freq_min~ group, random = ~1|study_id, data = mo))

Slightly edited results:

Fixed effects: freq_min ~ group 
                   Value Std.Error DF t-value p-value
(Intercept)        2.712    0.0804 70    33.7  0.0000
groupno_symptoms   0.353    0.1180 27     3.0  0.0058
groupwith_symptoms 0.195    0.1174 27     1.7  0.1086

For simplicity, I will use 2* std error as 95% CI.

In frequentist context, I would have summarized this as:

  • In the control group the estimated frequency was 2.7/min (maybe add CI here, but I avoid this sometimes because of the confusion created by absolute and difference CI).
  • In the no_symptoms group, the frequency was higher by 0.4/min, CI(0.11 to 0.59)/min, p = 0.006 than control.
  • In the with_symptoms group, the frequency was higher by 0.2/min, CI(-0.04 to 0.4)/min, p = 0.11 than control.

This is about the maximum acceptable complexity for a medical publication, the reviewer will probably ask me to add "not significant" in the second case.

Here is the same with stan_lmer and default priors.

freq_stan = stan_lmer(freq_min~ group + (1|study_id), data = mo)


           contrast lower_CredI frequency upper_CredI
        (Intercept)     2.58322     2.714       2.846
   groupno_symptoms     0.15579     0.346       0.535
 groupwith_symptoms    -0.00382     0.188       0.384

where CredI are 90% credible intervals (see the rstanarm vignette why 90% is used as default.)

Questions:

  • How to translate the above summary to the Bayesian world?
  • To what extent is prior-discussion required? I am quite sure the paper will come back with the usual "subjective assumption" when I mention priors; or at least with "no technical discussion, please". But all Bayesian authorities request that interpretation is only valid in the context of priors.
  • How can I deliver some "significance" surrogate in formulation, without betraying Bayesian concepts? Something like "credibly different" (uuuh…) or almost credibly different (buoha…, sounds like "at the brim of significance).

Jonah Gabry and Ben Goodrich (2016). rstanarm: Bayesian Applied Regression
Modeling via Stan. R package version 2.9.0-3.
https://CRAN.R-project.org/package=rstanarm

Stan Development Team (2015). Stan: A C++ Library for Probability and
Sampling, Version 2.8.0. URL http://mc-stan.org/.

Paul-Christian Buerkner (2016). brms: Bayesian Regression Models using Stan.
R package version 0.8.0. https://CRAN.R-project.org/package=brms

Pinheiro J, Bates D, DebRoy S, Sarkar D and R Core Team (2016). nlme: Linear
and Nonlinear Mixed Effects Models
. R package version 3.1-124, http://CRAN.R-project.org/package=nlme>.

Best Answer

Quick thoughts:

1) The key issue is what applied question you are trying to answer for your audience, because that determines what information you want from your statistical analysis. In this case, it seems to me that you want to estimate the magnitude of differences between groups (or perhaps the magnitude of ratios of the groups if that is the measure more familiar to your audience). The magnitude of differences is not directly provided by the analyses you presented in the question. But it is straight forward to get what you want from the Bayesian analysis: you want the posterior distribution of the differences (or ratios). Then, from the posterior distribution of the differences (or ratios), you can make a direct probability statement such as this:

"The 95% most credible differences fall between [low 95% HDI limit] and [high 95% HDI limit]" (here I'm using the 95% highest density interval [HDI] as the credible interval, and because those are by definition the highest density parameter values they are glossed as 'most credible')

A medical-journal audience would intuitively and correctly understand that statement, because it's what the audience typically thinks is the meaning of a frequentist confidence interval (even though that's not meaning of a frequentist confidence interval).

How do you get the differences (or ratios) from Stan or JAGS? Merely by post-processing of the completed MCMC chain. At each step in the chain, compute the relevant differences (or ratios), then examine the posterior distribution of the differences (or ratios). Examples are given in DBDA2E https://sites.google.com/site/doingbayesiandataanalysis/ for MCMC generally in Figure 7.9 (p. 177), for JAGS in Figure 8.6 (p. 211), and for Stan in Section 16.3 (p. 468), etc.!

2) If you are compelled by tradition to make a statement about whether or not a difference of zero is rejected, you have two Bayesian options.

2A) One option is to make probability statements regarding intervals near zero, and their relation to the HDI. For this, you set up a region of practical equivalence (ROPE) around zero, which is merely a decision threshold appropriate for your applied domain --- how big of a difference is trivially small? Setting such boundaries is routinely done in clinical non-inferiority testing, for example. If you have an 'effect size' measure in your field, there might be conventions for 'small' effect size, and the ROPE limits could be, say, half of a small effect. Then you can make direct probability statements such as these:

"Only 1.2% of the posterior distribution of differences is practically equivalent to zero"

and

"The 95% most credible differences are all not practically equivalent to zero (i.e., the 95% HDI and ROPE do not overlap) and therefore we reject zero." (notice the distinction between the probability statement from the posterior distribution, versus the subsequent decision based on that statement)

You can also accept a difference of zero, for practical purposes, if the 95% most credible values are all practically equivalent ot zero.

2B) A second Bayesian option is Bayesian null hypothesis testing. (Notice that the method above was not called "hypothesis testing"!) Bayesian null hypothesis testing does a Bayesian model comparison of a prior distribution that assumes the difference can only be zero against an alternative prior distribution that assumes the difference could be some diffuse range of possibilities. The result of such a model comparison (usually) depends very strongly on the particular choice of alternative distribution, and so careful justification must be made for the choice of alternative prior. It is best to use at-least-mildly-informed priors for both the null and alternative so that the model comparison is genuinely meaningful. Note that the model comparison provides different information than estimation of differences between groups because the model comparison is addressing a different question. Thus, even with a model comparison, you will still want to provide the posterior distribution of magnitude of differences between groups because the your audience will want to know the magnitude of difference and its uncertainty (credible interval) regardless of whether or not you decided to reject or accept a difference of zero.

There might be ways to do a Bayesian null hypothesis test from the Stan/JAGS/MCMC output, but I do not know in this case. For example, one could try a Savage-Dickey approximation to a Bayes factor, but that would rely on knowing the prior density on the differences, which would require some mathematical analysis or some additional MCMC approximation from the prior.

The two methods for deciding about null values are discussed in Ch. 12 of DBDA2E https://sites.google.com/site/doingbayesiandataanalysis/. But I really don't want this discussion to get side-tracked by a debate about the "proper" way to assess null values; they're just different and they provide different information. The main point of my reply is point 1, above: Look at the posterior distribution of the differences between groups.

Related Question