And please excuse my naivety regarding convergence diagnostics for bridge sampling. I see that such warnings would be non-trivial to construct, still I think it would be nice to have it (even if this is wishful thinking; you understand that better).
Still I think the situtation is not ideal. Perhaps, I could add an option to brms::bayes_factor that allows to automatically compute the marginal likelihood multiple times and reports a vector of bayes factors so that variability in the latter is immediately visible.
As I said before. The problem is that, even if the bridge sampler has converged with perfect accuracy, this is conditional on one specific set of posterior samples. A fully adequate assessment of convergence requires at least two independent sets of posterior samples. And this is outside of the scope of our package. So we will think about this to see if we can add something, but this will be only a bad solution. My advice: Any paper that reports Bayesian model selection based on marginal likelihoods needs to have calculated the Bayes factor or posterior model probabilities based on at least two independent sets of posterior distributions (best solution is to look at all possible combinations across the different estimates and models).
As another perhaps naive question: Could we combine multiple estimates of the marginal_likelihood (computed using repetitions) somehow to get a better estimate (maybe you covered this somewhere already)?
Hmm, if the estimate is unstable, this indicates too few posterior samples. I do not know of any theoretical results, but simply averaging or something like this in my experience does not guaranteed to converge on the true value. Unfortunately, more samples from the posterior distribution are necessary in this case.
I know that all these suggestions really make the calculation of Bayes factors using bridge sampling quite expensive in terms of time and computational resources. Unfortunately, it is a inherently difficult problem due to the two levels of uncertainty. The solution appears to require quite a lot of samples. But at least there is a solution.