Hypothesis testing, model selection, model comparison - some thoughts

EDIT: This was an attempt to write guidance. It turns out I stepped quite far from my depth and the text sounded much more conclusive than it should. I think it is correct to currently just classify it as “some thoughts” rather than a guidance. I still think it is useful to have a place to list possible approaches, but the text definitely needs more work. Sorry for the confusion.

Coming from classical statistics background Stan users often want to be able to test some sort of null hypothesis. Similarly, one may have multiple statistical models corresponding to different scientific hypotheses and wonder which one is better supported by the data, or simply be uncertain which of many possible models to use for the data at hand.

Some general principles I think it is useful to bear in mind:

  • The connection between a scientific problem/hypothesis/… and a statistical model/hypothesis/… is not straightforward and the mapping is often many to many.
  • Point hypothesis (e.g. presense/absence of an effect) are hard to consolidate with the Bayesian paradigm and unlikely to be completely defensible in most real problems (e.g. many interventions have small effects, but few have exactly zero effect). Point hypothesis are problematic in the presence of systematic biases; for more see Section 4 of https://arxiv.org/pdf/1803.08393.pdf We should rather think on a continuous scale: “How big is the effect?”, “How likely it is that the effect is larger than X?”.
  • We should embrace uncertainty - often we cannot definitively decide between competing models or we are unable to estimate something we care about with high precision. We should not seek absolute confidence where there is none to be found - but we may still be able to learn something at least.
  • Model misspecification is real. Almost all the approaches we use rely implicitly on the assumption that at least some of the models we have chosen are a reasonable match to the actual reality. This is by no means given and efforts should be made to verify modelling assumptions.

With that said, there are approaches that apply to situations where one would use hypothesis testing/model selection/… in the classical context. Currently there is no strong consensus in the community on how to approach such problems, so below is just a list of options to consider.

  • Compare a simpler model to a larger model: Separately fit a model with fewer parameters (e.g. setting an effect to zero).

    • You can use the loo package to approximate comparison of predictive performance via leave-one-out crossvalidation (see also Cross-validation FAQ). The interpretation of LOO however doesn’t tell you much about “presence” of an effect, it tells you (roughly) if you can explain your data better if you take the effect into account and it is a function of both your signal-to-noise ratio and the size of the actual effect.

    • Alternatively you could use Bayes factors to do that, but those can be problematic, as they are very sensitive to the priors you use in your model. Some more interesting criticism is by Data Colada and in the personal essay by D.J. Navarro. Bayes factors will give you relative KL-divergence (BF) of each models and may work weirdly if the actual data generating process is different from all the models you consider. BF is also a function of both your signal-to-noise ratio and the size of the actual effect. Also Bayes factor can be tricky to compute and are very sensitive to priors. See also When are Bayesian model probabilities overconfident?. The bridgesampling package can compute Bayes factors based on fitted models. For brms models, it can be invoked via the hypothesis function.

  • Determine range of practical equivalence In this approach, you include the effect of interest in the model. Strictly speaking P(\beta = 0) = 0 for all parameters \beta with continuous priors. But you can use domain expertise to say that e.g. a difference of 0.5 is practically irrelevant. P(|\beta| < 0.5) can then be computed directly from posterior samples. The advantage here is that (with enough data) you can not only claim that “If our assumptions are correct, the effect is large” but also that the effect is actually small. The hypothesis function in the brms package also supports this use case directly.

  • Think qualitatively Danielle Navarro has a great essay about model selection and how purely mathematical approaches can fail us: Between the devil and the deep blue sea. Checking whether the models satisfy some qualitative properties can also be of interest. This is related to posterior predictive checks (also called posterior retrodictive checks) - a model that fails some checks is (other things being equal) less preferable than model that passes such checks.

  • Use multiple models at the same time: We don’t necessarily need to choose just one model - in fact, in many cases we don’t have enough information to reliably select just one model and using a single model would those hide uncertainty we have about model selection.

    • Multiverse analysis: If there are multiple models to choose from and we don’t have a good reason to prefer some, we may as well fit all of the models and see if our conclusions are actually sensitive to model choice. See Steegen et al. for further discussion.

    • Bayesian model stacking: Stacking assigns a weight to each candidate model and we can then create a joint prediction as a weighted combination of the predictions of individual models. In some situations, stacking can give good results even if all the models considered are individually far from the true data generating process. See the vignette on stacking using the loo package and the original paper Yao et al.

Further reading:


BF is also a function of both your signal-to-noise ratio and the size of the actual effect. A paper on sensitivity of BF:

  • Oscar Oelrich, Shutong Ding, Måns Magnusson, Aki Vehtari, and Mattias Villani (2020). When are Bayesian model probabilities overconfident? arXiv preprint arXiv:2003.04026.

Great, thanks for brigning this up. I can put it in the text or (preferably for me) you should be able to edit the original post directly - I think those FAQs would work best if a larger group of people would write and edit them, so please do :-)

I wanted to comment on the experimental nature of the post in particular – I don’t think that there is enough consensus in the community on issues like these that a “primer” is appropriate. In particular I want to push back on the use of phrases like “the general approach we advocate” where the singular voice implies not only consensus but also a certain authority within the project. Much more appropriate for any “official” topic is a review that attempts only to collect various perspectives on the topic, including criticisms in as many ways as possible, that places no judgement on any particular perspective.

As to my opinions on the specific content:

The point hypothesis discussion requires far more nuance. Phrases like “nature does not like zeroes” aren’t particularly meaningful. More relevant is that small effects are indistinguishable from zero effects for any finite amount of data so the difference between those hypothesis is not a practical inferential question. Perhaps most importantly point hypothesis are problematic in the presence of systematic effects; for more see Section 4 of https://arxiv.org/pdf/1803.08393.pdf.

“Embracing uncertainty” is also vague. What is needed here is a careful discussion of inference (quantifying all hypotheses consistent with the data) verses decision making (choosing one that is best under some criteria) and when each are useful in statistical analyses. Encompassing the two as well are model-based calibrations.

While model misspecification is important, it’s important in general. Why is it particularly relevant to the discussion of hypothesis testing?

The posterior predictive density implicit to the loo default is also based on Kullback-Leilber divergences in the same way as Bayes factors and so they are vulnerable to some of the same pathologies. More importantly without any discussions of why these methods would even be used for model comparison there’s no context for discussing why one would be better than another.


@avehtari and @betanalpha are right to point out that I was a bit overeager in writing this (and quite out of my depth), I tried to edit the text to confer more uncertainty. Thanks for pointing this out. I would still very welcome any edits/further discussion as this is a topic the comes up repeatedly and AFAIK we don’t have a good place to direct users with those questions that would give at least rough overview of what one can do (especially I find that many forum users come with the preconception that if you want to “test a hypothesis” with a Bayesian model, you need to use Bayes factors).

Upon a bit more reflection: I think that in writing the original post I overstepped some boundaries that I should have noticed (especially in terms of speaking for the community and being overconfident in an area I don’t understand well). I apologize.

Also I should note that Aki suggested having some pre-screening and discussion of those proposed FAQs before just posting them online. I think this is a sensible way going forward, although I have yet to figure out if it is best carried out here on the forums (which I would prefer for the low barrier to entry, but which has worse support for collaborating on a document) or using an external service (Google Docs/Overleaf/HackMD).

I personally don’t think that such a goal is necessarily achievable, nor particularly useful to most users with those preconceptions. The problem is that before understanding the most useful answers they first have to understand the question that they are actually trying to ask. That requires backing up from Bayes factors to question the relevance of point hypotheses and then backing up even further to the relevance of testing in general. Telling someone asking about Bayes factors to not test at all is just going to be confusing unless they are lead backwards through that re-evaluation.

For this particular topic I think that it’s much more effective to provide an explicit narrative. You’re looking for Bayes factors? Cool, let’s talk about point hypothesis first. Actually let’s talk about testing, inference, and decisions first. Then we can go back to point hypothesis and then we can go back to Bayes factors.

Regarding the general experiment collaborative review can be quite useful so long as they don’t mislead the reader. I think that a few sentences noting that these topics are heavily debated and that the goal of the review is simply to introduce as many of the important concepts as possible would go a long way to setting a more responsible expectation. Documents like this should provide information, not instructions.