Tagging @Quentin
This is one of the reasons why, despite all the smart people working on this and the nice packages (bridgesampling is really
cool), I’m still very wary of most uses of Bayes factors in the wild. That said, I think something like @vasishth is proposing would be a good option for the bridgesampling package to provide.
I think it‘s unavoidable that people think BDA=Bayes factor. I just saw a post yesterday on a blog saying this. Better to make it part of the core Stan documentation else you will have people like me reporting unstable numbers. I hereby offer my services in that regard.
BTW, Paul’s suggestion to increase the number of iterations worked. I get some small differences between repeated computations of the Bayes factor, but it’s basically stable.
Also, now the results also make sense.
I think what we need is a way to diagnose convergence problems of the bridgesampling algorithm that tells us if the value it converged to may be unstable.
The default method for obtaining the marginal likelihood using the bridgesampling
package already provides an error measure. It uses the formula for the approximate relative meansquared error developed by Frühwirth–Schnatter (2004). So one could use this to investigate whether or not the marginal likelihood on which the Bayes factor is based is precise enough. However, we do not propagate this uncertainty to the calculation of the Bayes factor. I am not sure if there is a straightforward way to do so, but this is also not super important.
The main issue is that bridge sampling, like other sampling approaches, is a numerical method for which diagnosing convergence problems is generally not trivial. The added difficulty in our case is that the data that is used for sampling is itself a sample generated by an MCMC chain. Thus, there are two levels of numerical uncertainty; uncertainty from the posterior and from the bridge sampler.
So the only real way to ensure stability of the sampler is to do the equivalent of running independent MCMC chains, as done in the initial code here. Run Stan
multiple time to receive multiple independent samples from the posterior (of course each of those already consists of multiple independent chains). For each of those sets of posterior samples obtain at least one estimate of the marginal likelihood. If the estimates of the marginal likelihood are all near enough to each other (e.g., considering the magnitude of differences in marginal likelihoods between the different models), the Bayes factor will be kosher. If not, usually more samples from the posterior distribution are necessary.
So, to us it is not immediately apparent what kind of check to add to our package. The bridge_sampler
function already contains the argument repetitions
which allows to obtain more than one marginal likelihood estimate from one fixed set of posterior samples. This allows an estimation of the uncertainty on the second level. However, to get a full overview of the uncertainty a new set of posterior samples is necessary.
Perhaps most importantly, estimating the marginal likelihood usually requires at least one order of magnitude more samples than estimation. We warn about this both in our paper and the help page. For example (from ?bridgesampling::bridge_sampler
):
Also note that for testing, the number of posterior samples usually needs to be substantially larger than for estimation.
Maybe the easiest solution would be to add a similar warning to the help page of brms::bayes_factor
. And also encourage users to get at least two independent sets of posterior samples and estimates of the Bayes factor.
Hi Henrik,
thanks for all the explanation. This is indeed very helpful. in the brms doc I point to the bridgesampling doc but I will add some more information to the former to make sure people will read it.
And please excuse my naivety regarding convergence diagnostics for bridge sampling. I see that such warnings would be nontrivial to construct, still I think it would be nice to have it (even if this is wishful thinking; you understand that better).
Still I think the situtation is not ideal. Perhaps, I could add an option to brms::bayes_factor that allows to automatically compute the marginal likelihood multiple times and reports a vector of bayes factors so that variability in the latter is immediately visible.
As another perhaps naive question: Could we combine multiple estimates of the marginal_likelihood (computed using repetitions
) somehow to get a better estimate (maybe you covered this somewhere already)?
And please excuse my naivety regarding convergence diagnostics for bridge sampling. I see that such warnings would be nontrivial to construct, still I think it would be nice to have it (even if this is wishful thinking; you understand that better).
Still I think the situtation is not ideal. Perhaps, I could add an option to brms::bayes_factor that allows to automatically compute the marginal likelihood multiple times and reports a vector of bayes factors so that variability in the latter is immediately visible.
As I said before. The problem is that, even if the bridge sampler has converged with perfect accuracy, this is conditional on one specific set of posterior samples. A fully adequate assessment of convergence requires at least two independent sets of posterior samples. And this is outside of the scope of our package. So we will think about this to see if we can add something, but this will be only a bad solution. My advice: Any paper that reports Bayesian model selection based on marginal likelihoods needs to have calculated the Bayes factor or posterior model probabilities based on at least two independent sets of posterior distributions (best solution is to look at all possible combinations across the different estimates and models).
As another perhaps naive question: Could we combine multiple estimates of the marginal_likelihood (computed using repetitions) somehow to get a better estimate (maybe you covered this somewhere already)?
Hmm, if the estimate is unstable, this indicates too few posterior samples. I do not know of any theoretical results, but simply averaging or something like this in my experience does not guaranteed to converge on the true value. Unfortunately, more samples from the posterior distribution are necessary in this case.
I know that all these suggestions really make the calculation of Bayes factors using bridge sampling quite expensive in terms of time and computational resources. Unfortunately, it is a inherently difficult problem due to the two levels of uncertainty. The solution appears to require quite a lot of samples. But at least there is a solution.
Thank you very much for your insights. I just added more stuff to the brms doc to reflect what you have said.
I think bridgesampling is an amazing package to treat a very difficult but relevant problem. I don’t know of any other comparably general approach to compute the marginal likelihood so I think the additional computational time is well worth it and I am more than happy to support it in brms :)
I have a couple of minor follow up questions if you don’t mind:
 What’s is it exactly that enables bridgesampling to perform so much more accurately for higher number of samples? I assume it is the tails of the distribution, which tend to be estimated poorly with too few samples?
 If one wanted to have a helper function to split the posterior samples in half (say, by splitting the number of chains in case of equal number of chains) and to compute two bayes factors / marginal likelihoods each based on half of the samples, where would you put such a helper function?
 I understand where the bottleneck of the whole approach lies, but I think it could still be worth propagating the uncertainty from
bridge_sampler
tobayes_factor
somehow to warn users about possibly unstable results. Something that brms can throw in a manner that is impossible to overlook for users (as we try with the posterior itself as well as with other indicators of model fit). Right now, the print method ofbridge_sampler
just shows the result without any indication of the error measure (or am I mistaken)?
I have the same intuition as you do, but no data or papers to back that up.
The natural way in our package would be to add this to the bridge_sampler
function which computes the marginal likelihood. It is maybe a nice possibility to have this. The only problem I can see is that with many real life cases, the resulting stanfit
object would get so huge that it might not actually be feasible. But we will discuss this.
It is true that the print
method does not have it, only the summary
method. This is based on the error_measures
function. Using the example from ?bridge_sampler
:
bridge_result
# Bridge sampling estimate of the log marginal likelihood: 1.8378
# Estimate obtained in 4 iteration(s) via method "normal".
summary(bridge_result)
#
# Bridge sampling log marginal likelihood estimate
# (method = "normal", repetitions = 1):
#
# 1.837801
#
# Error Measures:
#
# Relative MeanSquared Error: 8.477251e08
# Coefficient of Variation: 0.0002911572
# Percentage Error: 0.0291%
#
# Note:
# All error measures are approximate.
error_measures(bridge_result)
# $`re2`
# [1] 8.477251e08
#
# $cv
# [1] 0.0002911572
#
# $percentage
# [1] "0.0291%"
We should probably add something like this to the example.
Hi all, I hope it’s ok to use this page for some followup questions about bridgesampling and brms, as it keep coming up in search queries for Bayes Factors and brms. For background, I have a brmsfit of a GLMM (family = bernoulli) with two grouping factors (participants and items, as is typical in my field). I have 20,000 posterior samples (instead of 20006000, which would be sufficient for estimation) and all common model diagnostics look good. (The files are large but if it helps I can upload the models or share a link.)
My goal is to compare a model with and without one of the fixed effects in the model (while keeping the random slope for that effect in both model). I’ve used bridge_sampler with 4 repetitions (maxiter = 1000) and method = “warp3”. The estimated BF that I obtain for the model comparison vary widely, including runs that seem to clearly favor one or the other model. Along with this come the wellknown warning that “logml could not be estimated within maxiter, rerunning with adjusted starting value.
Estimate might be more variable than usual.” I’m wondering whether adding more samples will really solve the problem, or whether the issue is in the fact that the sampler tries to explore the ‘typical’ region of the posterior distribution (rather than the tails)? Some specific questions I have:

Does bridge_sampler ‘care’ about the number of chains that provide the posterior samples, or are all posterior samples from the brms model pooled? I currently have used brms::combine_models() to combine various runs of the same model, so that 20,000 samples come from 8 rather than 4 chains.

The discussion above mentioned that the priors for estimation tend to be suboptimal as priors for model comparison. Currently, the brms models are fit with weakly regularizing priors, following recommendations at Prior Choice Recommendations · standev/stan Wiki · GitHub. Is that a bad idea for the purpose of model comparison? (if so, perhaps the wiki page could be edited? @andrewgelman)

I’ve read the vignette for using bridge_sampler with rstanfits, and I’m about halfway through the “Bayes Factor Workflow” paper by @vasishth @bnicenboim @paul.buerkner @andrewgelman and @betanalpha — very helpful!) but I haven’t yet found recommendations that would deal with the type of problem I’m encountering. Any further reading suggestions would be welcome.
 Alternative suggestions for testing the null would be appreciated, too. I’ve considered using brms’s hypothesis function to test the point hypothesis (via the SavageDickey method) but, based on what I’ve read, that also is likely to lead to unreliable (or even biased) estimations?
Thank you and apologies for ‘reheating’ this old thread.
I do not recommend using a Bayes factor. If you want to combine or compare the models, I recommend using leaveoneout cross validation and stacking; see here:
http://www.stat.columbia.edu/~gelman/research/published/stacking_paper_discussion_rejoinder.pdf
Thank you. We’ve been using loo in other projects. For some reason, it completely escaped me to use it for the present purpose of testing the null. (to clarify, the “combination” of models I was referring to was only the combination of chains / samples from different runs of the same model).
For the present purpose, I don’t need model stacking or other ways to derive predictions from sets of models but rather am interested in a measure of the support for the null (the two models differ in one parameter). Would you then report the ELPD difference between the models along with its SE, and small differences (say less than 2 SEs) would indicate that the null model largely has the same predictive accuracy as the model with the additional parameter?
Hi Florian,
You wrote:
" The discussion above mentioned that the priors for estimation tend to be suboptimal as priors for model comparison. Currently, the brms models are fit with weakly regularizing priors, following recommendations at Prior Choice Recommendations · standev/stan Wiki · GitHub. Is that a bad idea for the purpose of model comparison? (if so, perhaps the wiki page could be edited?"
From your message, it sounds like you are letting brms determine the priors. I would never do that for BF calculations (or even otherwise, unless I have literally no idea what the priors should be–this hardly ever happens). Just as a sanity check, could you try using more informative priors, determined using prior predictive checks?
What happens when you fit the model using anova() in R?
Another thing I would want to do is using simulated data that is similar to your data, with ground truth known.
Hi Shravan,
Thanks for your suggestions.
As for the priors, I’m not using brms default priors (fwiw, brms has plenty warnings that one should not do that). I’m using the Gelman recommended weaklyregularizing priors from the website I referenced. I could try more informative priors but a) for the problem at hand it is not clear what those would be (but of course, I can try to figure it out) and b) I would like to understand why that would help the sampler explore the tails (which I thought is the issue).
As for your anova() question, I’m not quite sure what you mean. anova() can’t be applied to brmsfits last I checked. Do you mean I should fit the model with glmer and then apply anova() to those fits? That works —though glmer wouldn’t converge with random correlations, which are relevant to this particular project but would go counter to the reasons I switched to Bayesian analyses.
Good idea also about simulating the ground truth. I have done that (that’s one of the reasons I came across your very helpful document bc I wanted to check whether others had done the same), but it didn’t really help me understand why bridgesampling threw the errors it throws despite the high number of samples (I’ve used bridgesampling for other projects with similar models, similar amounts of data, and fewer posterior samples, and didn’t run into this problem).
Following @andrewgelman’s suggestion, I have now applied loo to the models, instead. That resulted in a warning (based on the paretok threshold) that 253 observations (about 10%) in the model are problematic. I then reran loo with moment_match = T. This ran for about 10 hours on four cores and then return an error message:
Error : $ operator is invalid for atomic vectors
In addition: Warning messages:
1: In doTryCatch(return(expr), name, parentenv, handler) :
restarting interrupted promise evaluation
2: In doTryCatch(return(expr), name, parentenv, handler) :
restarting interrupted promise evaluation
3: In doTryCatch(return(expr), name, parentenv, handler) :
restarting interrupted promise evaluation
4: In doTryCatch(return(expr), name, parentenv, handler) :
restarting interrupted promise evaluation
5: In doTryCatch(return(expr), name, parentenv, handler) :
restarting interrupted promise evaluation
6: In parallel::mclapply(X = I, mc.cores = cores, FUN = function(i) loo_moment_match_i_fun(i)) :
scheduled core 1, 4 encountered error in user code, all values of the job will be affected
Error: Moment matching failed. Perhaps you did not set 'save_all_pars' to TRUE when fitting your model?
But fwiw, save_all_pars is set to T; or rather, under the new syntax of brm: save_pars = save_pars(all = T)
. So, I am at a loss as to what’s going on. The model with samples is about 400 MB but here’s a link in case anybody is willing to check what they get if they loo it: Dropbox  TFJ_sp_combined_brms.RDS  Simplify your life
I would first start with a simpler model (varying intercepts only, plus at most varying slopes for the target parameter), and check if that gets you somewhere. Fitting a “maximal” model under conditions of sparse data means you are just getting back the priors you specified for all those abstract variance components. Then one can try building up to more complex models.
By anova I meant the frequentist likelihood ratio test. I just wanted to know what the estimates of the effects are and how much uncertainty there is (SE) and what the likelihood ratio test says.
Hello Sir,
Perhaps increase the available memory —that’s if the moment matching is running on remote server with the option to assign more space per nodes/cores. I eventually started associating these errors ( … invalid atomic vector & moment match failed) and warning item 6 — with the algorithm taking up more memory than I assigned on the super computing server I was using. In my case – moment matching LOO eventually needed 40 GB to run uninterrupted.
Thank you for the pdf link.