Bayes for hypothesis testing in clinical trials

How can I reject hypotheses in a Bayesian clinical trial? I am trying to introduce a fully Bayesian mindset in clinical trials and start using a Bayesian philosophy. However, this is very complicated, because neither the companies nor the regulatory agencies are ready for the change.

Some of my colleagues proposed to use and equal-tailed CI as a substitute of the confidence interval. I made a subtle change, I thought that using the HPD (highest posterior density) as a substitute for the confidence interval to reject the hypothesis that a treatment has zero effect would be the most appropriate.

However, I have been reading this post by @andrewgelman (Shortest posterior intervals), and I am even more confused than before. It seems that intervals are not the best way to summarize a posterior density, but it also seems that there is no established alternative. The idea of using a utility function is smart, but not realistic in the current context. My question is what is the best way to reject a hypothesis in the Bayesian context, and present an interval. Bayes factors were an idea that has not been developed so well. Also, what would be the correct answer in a fully Bayesian approach. Additionally, in the Bayesian context there is no multiple testing, because all the evidence comes from the data obtained. However, this would not be accepted by any agency, because there are many chances that by testing multiple hypotheses, one of them shows evidence by chance. What alternative would there be?

Thank you very much

The solution to multiple testing is to estimate all comparisons together using a multilevel model, rather than following the classical, noisy, approach of looking at the largest comparison in the data. See here:

1 Like

As detailed in Introduction to Bayes for Evaluating Treatments Bayesian methods are not intended for hypothesis testing but rather for taking actions / making decisions. And even though HPD intervals are preferred, one should separate such intervals from the directional assertions that are made in treatment comparisons. For example, key assertions in a blood pressure reduction study may be “there is a reduction in systolic blood pressure” and “there is a reduction of SBP more than 5mmHg”. Posterior probabilities would be computed for those.

With uncertainty intervals the analyst controls the coverage probability. With specific assertions the analyst controls the threshold or interval limits and gets back a probability.


I agree with Frank that Bayes rocks for decision making. But we also use Bayes for inference, even if no actions or decisions are immediately forthcoming. Just because we want to understand the world better. That said, I also agree with Frank that hypothesis testing is typically a bad idea. Some of my reasons for that attitude are discussed here:


There is so much in this review of 3 books on causal inference. Thanks Andrew for alerting me to this review.

1 Like

The original post asks (paraphrased): how can I reject hypotheses in a Bayesian clinical trial … and present an interval … and test multiple hypotheses … and present the results to agencies?

Here are two considerations. First, what is best practice. Second, reporting whatever analysis you landed on. Re best practice, Andrew and Frank gave pointers in their replies. Re reporting, see this recent post in the forum:

In particular, note that if you use a Bayesian hypothesis test, the BARG recommend not to report only the Bayes factor, but to report also the decision criterion for the posterior model probability and the minimum prior model probability for which the criterion is exceeded.


Thank you very much for your responses. I have also read the paper attached by @andrewgelman regarding multiple hypothesis comparisons, which I find to be very elegant; however, it is quite difficult to apply to hypotheses that measure very different things; for instance, in clinical trials, where a drug needs to present its efficacy and safety across a set of outcomes that differ from one another.
To put in context, FDA or EMA have always used methods to control the Family-Wise Error Rate (FWER), such as the Holm procedure or Bonferroni correction, in addition to requiring the presentation of a confidence interval for each hypothesis. However, I want to change this arbitrary criteria, and I am trying to introduce Bayesian knowledge into the approval of drugs (despite complicating my life). What would be the correct way to present the outcomes for approval? As I mentioned earlier, it feels [1] the Stan community is against presenting any 95% intervals. Reading BARG [2] from @JohnKruschke, seems that the alternative would be to present a Bayes factor, or the posterior probability that an outcome is greater than 0, right? And regarding multiple comparisons, these agencies do not understand that there is no need to correct for multiple hypotheses. Should a multilevel model be made for, say, 8 hypotheses, even if they measure very different things? e.g. how would this be done to multiple test two hypothesis that measure the impact on a biomarker and a satisfaction test? Is there any solid argument to convince them that there is no need for multiple correction? Thank you very much for all the interaction in this post.

Multiplicities come from chances you give data to be extreme. Bayes doesn’t compute probabilities about data so everything is different. Bayes deals with the issues of multiple endpoints in a variety of ways detailed here including specification of priors for each estimands, and computing “more than null” probabilities, e.g. Pr(blood pressure reduction > 3mmHg) which in a frequentist sense hugely reduces \alpha.

One of the examples in the linked e-book uses Pr(at least 3 out of 5 endpoints are improved by the treatment). This to me is a very appealing way to deal with multiple outcomes, if you can specify the joint models.

Bayesian thinking does not discourage the provision of uncertainty intervals (highest posterior density intervals are best) but does not use them for the primary evidence presentation or decision making.

1 Like

Just a quick reply (I’m in a rush)…
First, if an agency requires controlling error rates, well, then you’re stuck jumping through that hoop and going down the frequentist rabbit hole.
Second, the Bayesian approach advocated by Gelman et al. is using hierarchical shrinkage to reduce magnitudes of spurious effects. But this approach still involves choices (of model structure, of priors) and has no explicit measurement of error rates. And hierarchical structure doesn’t have to be estimated via Bayesian but doing it Bayesian yields coherent interval estimates (instead of approximated confidence intervals) and robust convergence. But you could do hierarchical models with shrinkage and p values (gasp!).
Third, if you do Bayesian model comparison / hypothesis testing, the BARG recommend making decisions based on the posterior model probability exceeding some decision threshold such as 95% or whatever (not just greater than 0, even the prior makes it greater than 0), and reporting the minimal prior probability that would exceed that decision threshold.
Finally, one aspect not mentioned is best practice when doing multiple Bayesian hypothesis tests, that is, doing lots of pairwise model comparisons for a variety of different models. I’ll leave it to others to discuss that.


Excellent points John. The first reaction to a request for “error rates” (they are neither errors not rates but that’s another story …) is to fight back by saying the reason a Bayesian approach was selected is that we wish to compute different probabilities than frequentist’s probabilities about data; we’re interested in evidence about effects. If we wanted to deal with the probability of making an assertion when by some miracle the treatment effect was exactly zero we would have chosen a frequentist approach. We have to at least go down kicking and screaming.