Bayesian power analysis for sample size planning


This is sort of a “higher-level” question, but I think there are a lot of people in this forum who might have stumbled upon the very same issue and could share their own experiences or insights!

So, I prepared a registered report (RR) which includes a section about sample size planning.

As I had an idea of the typicial sample size (from several similar previous studies) I firstly reported that. However, I have received the Editor’s feedback on the RR and the first comment asks me to perform a power analysis to inform and support my decision (which I think is a sensible idea).

Since I represented my research hypotheses (as different population-level coefficients) with a GLMM that will be estimated with Bayesian methods (using brms), my first thought was "let’s simulate data and fit the model!" (akin to this excelent walkthrough).

However, as my model has several predictors and group-level effects over most of those predictors, each run is rather slow (~10hs). I know this because I have performed a parameter recovery analysis using previous studies’ sample size (to assert the feasibility of my planned analysis).

Thus, even if I test just a few sample sizes (e.g., 4) with a small simulation batch (n_sim = 500) per sample size, this results in ~20000hs > 2 years!.

So, it seems like this path is a dead-end…

To my -probably limited- knowledge, there might be some non-exclusive alternatives (albeit non-ideal):

  • Specify a simplified version of the model (e.g., with group-level effects over the model Intercept, or even removing group-level effects altogether?) that still captures my hypotheses, and perform a power analysis with simulations.
  • Argument that it is unfeasible to perform a proper power analysis (from what I described above)

I have found a related question in this very forum, and I have looked into the book-chapter refered in the accepted answer, but wasn’t able to extract any information that could guide my decision on what to do.

What do you think? Has anyone else been in this situation?

Plase, share any comment/experience/suggestions (or even other forums where it might be more likely to get answers)!


Hi there,

You are on the right track. I don’t use power analysis and it strikes me as weird from a Bayesian PoV. Running simulations with differing population numbers sounds good.

My first thought is the model might be misspecified. However without seeing the model and the simulated data this is a guess.

For next steps

  1. Start with the simplest but still interesting model you can. Run that with your simulated data. Can you recover all the parameters? Do the priors make sense given your data and previous knowledge?

  2. If you can post your simplified model code and simulated data code that well help folks help you out.


I’m just going to echo @Ara_Winter here: a power analysis is a little odd from a Bayesian framework. Traditionally, the reason to do a power analysis is to show reasonable ability to correctly reject the null hypothesis. This being said, I do know that Kruschke has a paper that does talk about what power analysis looks like in Bayesian statistics. It may be helpful to look over his characterization of Bayesian power analysis and see whether his approach makes more sense for what you’ve proposed in your RR.

Also, just to echo the point on simulation being a good place to start. I wanted to pass along a resource recommended to me on this forum for simulating datasets.

One thing you may consider looking at as well is sensitivity checks in the priors. Commonly, power analyses are done to determine a target sample size to detect effects of various sizes, so editors/reviewers may be familiar with power analysis as a proxy for sample size adequacy. There are several papers on the use of Bayesian methods for small samples where perfectly good results are obtained with reasonably informative priors, and I think that’s where maybe “sample size adequacy” might come in with Bayesian methods. Essentially, you might want to just show one of two things: (a) that your sample size is large enough that your results are largely insensitive to your priors or (b) that your priors are informative enough to arrive at “correct” inferences given your sample size. There was a recent post on this forum on prior sensitivity checks. This should all be do-able with simulated data.

Though you plan to use brms, which is really good about creating Stan models that are efficient, you might still consider sharing your planned call to brms to see if there might be any speed-up tricks that people could recognize. There have been a few posts on the forum in the last couple of months on using sufficient statistics to speed up modeling. Similarly, brms support of cmdstan also means that potentially multithreading or a variational Bayes estimator could be used to speed up these preliminary fits. If there’s any of these kinds of recommendations from the community, then it may help ensure you have reasonable wait times for your simulation fits.

1 Like

I have written this preprint about power analyses using brms

The idea covered in the paper is a simple comparison of two groups, but it is in principle extendable to more complex model parameters as well. Most likely some of the exact code might not be applicable, but I think there are some useful things in it - for example, I show how you can use a package called furrr to run simulations in parallel and reduce the simulation time by a long way. Perhaps a combination of a simplified model and using furrr to run the things in parallel could be useful.

I do get that people think the idea of power analyses might be strange in a Bayesian framework but I’m not entirely sure I agree. Firstly, one can also use sample size planning to determine how precise an estimate can become given different samples sizes. Secondly, we often might use some rules of thumb to determine whether we think an effect is of sufficient size to care about (such as a parameter being inside or outside a range of practical equivalence), and power analysis can help us do this with different effect sizes. Before we run an experiment, we do not know what the data will be and so it can make sense to consider hypothetical data sets we might observe and see what power we have given different observed data and different sample sizes, to help plan a study.

Do let me know if you have any questions. All the code is available with a link in the text. Also if others on the forum have feedback on the paper.


There are some implicit assumptions being made here that I think are complicating some of the discussion.

Formally a power analysis is a calibration of a null-hypothesis significance test. For each configuration of a non-null, or alternative, hypothesis one can repeatedly simulate data and see how often the null-hypothesis test correctly rejects the null hypothesis. In most null-hypothesis significance tests these calculations are done analytically to avoid the cost of simulation, but the results should be similar.

Ideally there would be a lower bound to this continuum of true positive rates, but when the null hypothesis is nested within the alternative hypothesis, for example when the null is defined by \theta = 0 and the alternative is defined by \theta \ne 0, then the lower bound would always be zero because some configuration of the alternative model will be close enough to the null hypothesis to be indistinguishable from the null hypothesis.

The typical fix is a bit of a sludge – one introduces a minimum effect size and then defines the power as the true positive rate corresponding to that alternative model configuration. A minimal sample size can then be defined as amount of data needed to keep the power above a desired level.

The whole procedure starts to all apart a bit when the null and alternative hypotheses are no longer one-dimensional. Moreover all of the calculations require very simple model assumptions, and executing a power analysis for more sophisticated null and alternative hypothesis becomes extremely expensive, if not outright infeasible. Unfortunately many take these simple assumptions, and a cheap power analysis, for granted.

For more discussion with pictures see for example Section 4.2 of [1803.08393] Calibrating Model-Based Inferences and Decisions.

So, what does all of this mean?

Firstly a power analysis is applicable only when one wants to emulate a null-hypothesis significant test in a more Bayesian way, for example by defining a null model through a region of practical equivalence and then rejecting that null based on the posterior probability allocated to that region. If no such decision will be made then a power analysis simply isn’t needed. If a decision is of interest then one can replace the power analysis with a Bayesian calibration. See for example Section 4.3-4.5 of [1803.08393] Calibrating Model-Based Inferences and Decisions as well as Section 3.3 of Probabilistic Modeling and Statistical Inference and Section 1.3 of Towards A Principled Bayesian Workflow.

That said even a binary decision isn’t planned one still might be interested in the range of posterior behaviors that might arise given your modeling assumptions. For example one might be curious how often posterior quantile intervals cover the true value or how often the posterior is narrow enough to resolve a certain difference from zero. Here one can consider Bayesian calibration with a different utility function than true/false positive rates, or even a more qualitative utility function implemented as a visualization such as the “eye chart” discussed in that last reference.

There’s no way to reduce these Bayesian calibration studies to a single number, so they probably won’t satisfy the naive request of the editor, but they can still be extremely useful in building expectations for how your inferences might perform presuming that your model captures the relevant features of the true data generating process.

Additionally when working with sophisticated models the computation needed for even a crude Bayesian calibration with just a few simulated observations can indeed be intense. Conditioned on the simulated data each posterior fit is independent of the others and can be run in parallel, however, which allows for the use of cluster/cloud resources when they’re available. In the end however, even a crude Bayesian calibration can be useful to identify some potential behaviors so long as we don’t assume that it covers all possible behaviors.