EDIT: This was an attempt to write guidance. It turns out I stepped quite far from my depth and the text sounded much more conclusive than it should. I think it is correct to currently just classify it as “some thoughts” rather than a guidance. I still think it is useful to have a place to list possible approaches, but the text definitely needs more work. Sorry for the confusion.
Coming from classical statistics background Stan users often want to be able to test some sort of null hypothesis. Similarly, one may have multiple statistical models corresponding to different scientific hypotheses and wonder which one is better supported by the data, or simply be uncertain which of many possible models to use for the data at hand.
As a general introduction, Workflow Techniques for the Robust Use of Bayes Factors by Schad et al. has a good high-level view on what a “hypothesis” is in the context of Bayesian statistics (see section “Inference and discovery”)
Some general principles I think it is useful to bear in mind:
- The connection between a scientific problem/hypothesis/… and a statistical model/hypothesis/… is not straightforward and the mapping is often many to many.
- Point hypothesis (e.g. presense/absence of an effect) are hard to consolidate with the Bayesian paradigm and unlikely to be completely defensible in most real problems (e.g. many interventions have small effects, but few have exactly zero effect). Point hypothesis are problematic in the presence of systematic biases; for more see Section 4 of https://arxiv.org/pdf/1803.08393.pdf We should rather think on a continuous scale: “How big is the effect?”, “How likely it is that the effect is larger than X?”.
- We should embrace uncertainty - often we cannot definitively decide between competing models or we are unable to estimate something we care about with high precision. We should not seek absolute confidence where there is none to be found - but we may still be able to learn something at least.
- Model misspecification is real. Almost all the approaches we use rely implicitly on the assumption that at least some of the models we have chosen are a reasonable match to the actual reality. This is by no means given and efforts should be made to verify modelling assumptions.
With that said, there are approaches that apply to situations where one would use hypothesis testing/model selection/… in the classical context. Currently there is no strong consensus in the community on how to approach such problems, so below is just a list of options to consider.
-
Compare a simpler model to a larger model: Separately fit a model with fewer parameters (e.g. setting an effect to zero).
-
You can use the
loo
package to approximate comparison of predictive performance via leave-one-out crossvalidation (see also Cross-validation FAQ). The interpretation of LOO however doesn’t tell you much about “presence” of an effect, it tells you (roughly) if you can explain your data better if you take the effect into account and it is a function of both your signal-to-noise ratio and the size of the actual effect. -
Alternatively you could use Bayes factors to do that, but those can be problematic, as they are very sensitive to the priors you use in your model and often hard to compute accurately. Technical aspects are neatly discussed in Workflow Techniques for the Robust Use of Bayes Factors by Schad et al. An important insight neatly shown in Schad et al. is that Bayes factors compares how likely are the data under the prior predictive distribution of each model - which means that BFs are very sensitive to prior choice. Another important point is that using Bayes factors for inference (getting posterior probabilities of models) can have good properties while using them to choose a single model can behave very badly. The personal essay by D.J. Navarro has a critique of Bayes factors from a different angle. Bayes factors may give quite unintuitive results if the actual data generating process is different from all the models you consider. BF is also a function of both your signal-to-noise ratio and the size of the actual effect. See also When are Bayesian model probabilities overconfident?. The
bridgesampling
package can compute Bayes factors based on fitted models. Forbrms
models, it can be invoked via thehypothesis
function.- [1905.08737] On the marginal likelihood and cross-validation Discusses the connection between cross-validation and marginal likelihood (and by extension Bayes factors)
-
-
Determine range of practical equivalence In this approach, you include the effect of interest in the model. Strictly speaking P(\beta = 0) = 0 for all parameters \beta with continuous priors. But you can use domain expertise to say that e.g. a difference of 0.5 is practically irrelevant. P(|\beta| < 0.5) can then be computed directly from posterior samples. The advantage here is that (with enough data) you can not only claim that “If our assumptions are correct, the effect is large” but also that the effect is actually small. The
hypothesis
function in thebrms
package also supports this use case directly. -
Think qualitatively Danielle Navarro has a great essay about model selection and how purely mathematical approaches can fail us: Between the devil and the deep blue sea. Checking whether the models satisfy some qualitative properties can also be of interest. This is related to posterior predictive checks (also called posterior retrodictive checks) - a model that fails some checks is (other things being equal) less preferable than model that passes such checks.
-
Use multiple models at the same time: We don’t necessarily need to choose just one model - in fact, in many cases we don’t have enough information to reliably select just one model and using a single model would those hide uncertainty we have about model selection.
-
Multiverse analysis: If there are multiple models to choose from and we don’t have a good reason to prefer some, we may as well fit all of the models and see if our conclusions are actually sensitive to model choice. See Steegen et al. for further discussion.
-
Bayesian model stacking: Stacking assigns a weight to each candidate model and we can then create a joint prediction as a weighted combination of the predictions of individual models. In some situations, stacking can give good results even if all the models considered are individually far from the true data generating process. See the vignette on stacking using the
loo
package and the original paper Yao et al.
-
Further reading:
- The preprint on Bayesian workflow by Gelman et al.
- @betanalpha’s case studies on workflow: Falling (In Love With Principled Modeling), Towards A Principled Bayesian Workflow
- @avehtari’s talk on StanCon Helsinki on Model assesment and selection and another talk on Frequency evaluation, hypothesis testing and variable selection - skip to the first slide with title Hypothesis testing.
- The introductory chapter of R. McElreath’s Rethinking has a nice discussion on the connection between statistical models and scientific hypothesis also recorded talks are available online (you want the first lecture)
- Workflow Techniques for the Robust Use of Bayes Factors by Schaed et al. discuss how to test robustness of Bayes factors for practical problems.