# Bayes factor using brms

Hi all,

Attached are some data that I am trying to do a Bayes factor analysis on using brms.
Reproducible code is below. I set pretty reasonable priors for the fixed effects factors. I am testing the null hypothesis that the parameter for the factor dat is 0, against the alternative prior that the parameter for dat is Normal(0,1). After fitting the full and null models, i.e., models with and without dat as a fixed factor, I computed bayes factor using the function bayes_factor in brms 100 times, and each time I get a new number:

``````summary(bf_values)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.05725 0.12663 0.19257 0.23821 0.30977 0.85361
``````

So the BF in favor of the alternative is between 1 and 17! Once the models are fitted, why does the bayes factor vary so much?

dat.txt (16.2 KB)

``````# load data:

library(brms)
#- Null: Effect of Dat = 0
#- Alternative: Normal(0,1) for all the fixed effects factors

priors_logrt_normal<-c(set_prior("normal(0,10)", class = "b"),
set_prior("normal(0,1)", class = "b",coef="dat"),
set_prior("normal(0,1)", class = "b",coef="int"),
set_prior("normal(0,1)", class = "sd"),
set_prior("normal(0,1)", class = "sigma"),
set_prior("lkj(2)", class = "cor"))

full1 <-brm(formula = log(region7) ~ dat + adj + int  +
(1 +
| subj)+ (1 +
| item),
family = gaussian(),
prior = priors_logrt_normal,
warmup = 1000,
iter = 2000,
chains = 4,
save_all_pars= TRUE,

## null model:
NULLpriors_logrt_normal<-c(set_prior("normal(0,10)", class = "b"),
set_prior("normal(0,1)", class = "b",coef="int"),
set_prior("normal(0,1)", class = "sd"),
set_prior("normal(0,1)", class = "sigma"),
set_prior("lkj(2)", class = "cor"))

null1 <- brm(formula = log(region7) ~ adj + int  +
(1 + dat+ adj + int
| subj)+ (1 +
| item),
family = gaussian(),
prior = NULLpriors_logrt_normal,
warmup = 1000,
iter = 2000,
chains = 4,
save_all_pars= TRUE,

set.seed(4321)

bf_values<-rep(NA,100)
for(i in 1:100){
bf01<-bayes_factor(null1,full1)
bf_values[i]<-bf01\$bf
}

summary(bf_values)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
0.05725 0.12663 0.19257 0.23821 0.30977 0.85361
``````

sessionInfo()

``````R version 3.4.4 (2018-03-15)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets
[6] methods   base

other attached packages:
[1] rlist_0.4.6.1    HDInterval_0.1.3 lattice_0.20-35
[4] magrittr_1.5     dplyr_0.7.4      tidyr_0.8.0
[7] gdata_2.18.0     brms_2.1.9       ggplot2_2.2.1
[10] Rcpp_0.12.16     devtools_1.13.5

loaded via a namespace (and not attached):
[1] Brobdingnag_1.2-5    httr_1.3.1
[5] threejs_0.3.1        shiny_1.0.5
[7] assertthat_0.2.0     stats4_3.4.4
[9] yaml_2.1.19          pillar_1.2.1
[11] backports_1.1.2      glue_1.2.0
[13] digest_0.6.15        colorspace_1.3-2
[15] htmltools_0.3.6      httpuv_1.3.6.2
[17] Matrix_1.2-12        plyr_1.8.4
[19] dygraphs_1.1.1.4     pkgconfig_2.0.1
[21] rstan_2.17.3         purrr_0.2.4
[23] xtable_1.8-2         mvtnorm_1.0-7
[25] scales_0.5.0         git2r_0.21.0
[27] tibble_1.4.2         bayesplot_1.4.0
[29] DT_0.4               withr_2.1.2
[31] shinyjs_1.0          lazyeval_0.2.1
[33] mime_0.5             memoise_1.1.0
[35] evaluate_0.10.1      papaja_0.1.0.9735
[37] nlme_3.1-131.1       xts_0.10-2
[39] colourpicker_1.0     data.table_1.10.4-3
[41] rsconnect_0.8.8      tools_3.4.4
[43] loo_1.1.0            matrixStats_0.53.1
[45] stringr_1.3.1        munsell_0.4.3
[47] bindrcpp_0.2         compiler_3.4.4
[49] rlang_0.2.0          grid_3.4.4
[51] htmlwidgets_1.0      crosstalk_1.0.0
[53] igraph_1.2.1         miniUI_0.1.1
[55] base64enc_0.1-3      rmarkdown_1.9.8
[57] gtable_0.2.0         codetools_0.2-15
[59] inline_0.3.14        abind_1.4-5
[61] curl_3.1             markdown_0.8
[63] reshape2_1.4.3       R6_2.2.2
[65] gridExtra_2.3        rstantools_1.4.0
[67] zoo_1.8-1            knitr_1.20
[69] bridgesampling_0.4-0 bindr_0.1.1
[71] shinystan_2.4.0      shinythemes_1.1.1
[73] rprojroot_1.3-2      stringi_1.2.2
[75] parallel_3.4.4       coda_0.19-1
``````
1 Like

This may happen with the bridgesampling method and I agree this is problematicā¦

For bridgesampling to yield stable results you may need much more posterior samples. Say iter = 10000, warmup = 1000, chains = 4.

Thanks. Is it possible to set a default of say 10 runs in bayes_factor, and then issue a warning if results are beyond a certain tolerance that can be fixed as a parameter? The instability could be a diagnostic like the R-hat. I noticed the problem already after I ran it 2-3 times.

I think this is something you have to ask the developers of the bridgesampling package.

Tagging @Quentin

This is one of the reasons why, despite all the smart people working on this and the nice packages (bridgesampling is really
cool), Iām still very wary of most uses of Bayes factors in the wild. That said, I think something like @vasishth is proposing would be a good option for the bridgesampling package to provide.

I think itās unavoidable that people think BDA=Bayes factor. I just saw a post yesterday on a blog saying this. Better to make it part of the core Stan documentation else you will have people like me reporting unstable numbers. I hereby offer my services in that regard.

BTW, Paulās suggestion to increase the number of iterations worked. I get some small differences between repeated computations of the Bayes factor, but itās basically stable.

Also, now the results also make sense.

1 Like

I think what we need is a way to diagnose convergence problems of the bridgesampling algorithm that tells us if the value it converged to may be unstable.

1 Like

The default method for obtaining the marginal likelihood using the `bridgesampling` package already provides an error measure. It uses the formula for the approximate relative mean-squared error developed by FrĆ¼hwirthāSchnatter (2004). So one could use this to investigate whether or not the marginal likelihood on which the Bayes factor is based is precise enough. However, we do not propagate this uncertainty to the calculation of the Bayes factor. I am not sure if there is a straight-forward way to do so, but this is also not super important.

The main issue is that bridge sampling, like other sampling approaches, is a numerical method for which diagnosing convergence problems is generally not trivial. The added difficulty in our case is that the data that is used for sampling is itself a sample generated by an MCMC chain. Thus, there are two levels of numerical uncertainty; uncertainty from the posterior and from the bridge sampler.

So the only real way to ensure stability of the sampler is to do the equivalent of running independent MCMC chains, as done in the initial code here. Run `Stan` multiple time to receive multiple independent samples from the posterior (of course each of those already consists of multiple independent chains). For each of those sets of posterior samples obtain at least one estimate of the marginal likelihood. If the estimates of the marginal likelihood are all near enough to each other (e.g., considering the magnitude of differences in marginal likelihoods between the different models), the Bayes factor will be kosher. If not, usually more samples from the posterior distribution are necessary.

So, to us it is not immediately apparent what kind of check to add to our package. The `bridge_sampler` function already contains the argument `repetitions` which allows to obtain more than one marginal likelihood estimate from one fixed set of posterior samples. This allows an estimation of the uncertainty on the second level. However, to get a full overview of the uncertainty a new set of posterior samples is necessary.

Perhaps most importantly, estimating the marginal likelihood usually requires at least one order of magnitude more samples than estimation. We warn about this both in our paper and the help page. For example (from `?bridgesampling::bridge_sampler`):

Also note that for testing, the number of posterior samples usually needs to be substantially larger than for estimation.

Maybe the easiest solution would be to add a similar warning to the help page of `brms::bayes_factor`. And also encourage users to get at least two independent sets of posterior samples and estimates of the Bayes factor.

1 Like

Hi Henrik,

thanks for all the explanation. This is indeed very helpful. in the brms doc I point to the bridgesampling doc but I will add some more information to the former to make sure people will read it.

And please excuse my naivety regarding convergence diagnostics for bridge sampling. I see that such warnings would be non-trivial to construct, still I think it would be nice to have it (even if this is wishful thinking; you understand that better).

Still I think the situtation is not ideal. Perhaps, I could add an option to brms::bayes_factor that allows to automatically compute the marginal likelihood multiple times and reports a vector of bayes factors so that variability in the latter is immediately visible.

As another perhaps naive question: Could we combine multiple estimates of the marginal_likelihood (computed using `repetitions`) somehow to get a better estimate (maybe you covered this somewhere already)?

And please excuse my naivety regarding convergence diagnostics for bridge sampling. I see that such warnings would be non-trivial to construct, still I think it would be nice to have it (even if this is wishful thinking; you understand that better).

Still I think the situtation is not ideal. Perhaps, I could add an option to brms::bayes_factor that allows to automatically compute the marginal likelihood multiple times and reports a vector of bayes factors so that variability in the latter is immediately visible.

As I said before. The problem is that, even if the bridge sampler has converged with perfect accuracy, this is conditional on one specific set of posterior samples. A fully adequate assessment of convergence requires at least two independent sets of posterior samples. And this is outside of the scope of our package. So we will think about this to see if we can add something, but this will be only a bad solution. My advice: Any paper that reports Bayesian model selection based on marginal likelihoods needs to have calculated the Bayes factor or posterior model probabilities based on at least two independent sets of posterior distributions (best solution is to look at all possible combinations across the different estimates and models).

As another perhaps naive question: Could we combine multiple estimates of the marginal_likelihood (computed using repetitions) somehow to get a better estimate (maybe you covered this somewhere already)?

Hmm, if the estimate is unstable, this indicates too few posterior samples. I do not know of any theoretical results, but simply averaging or something like this in my experience does not guaranteed to converge on the true value. Unfortunately, more samples from the posterior distribution are necessary in this case.

I know that all these suggestions really make the calculation of Bayes factors using bridge sampling quite expensive in terms of time and computational resources. Unfortunately, it is a inherently difficult problem due to the two levels of uncertainty. The solution appears to require quite a lot of samples. But at least there is a solution.

Thank you very much for your insights. I just added more stuff to the brms doc to reflect what you have said.

I think bridgesampling is an amazing package to treat a very difficult but relevant problem. I donāt know of any other comparably general approach to compute the marginal likelihood so I think the additional computational time is well worth it and I am more than happy to support it in brms :-)

I have a couple of minor follow up questions if you donāt mind:

1. Whatās is it exactly that enables bridgesampling to perform so much more accurately for higher number of samples? I assume it is the tails of the distribution, which tend to be estimated poorly with too few samples?
2. If one wanted to have a helper function to split the posterior samples in half (say, by splitting the number of chains in case of equal number of chains) and to compute two bayes factors / marginal likelihoods each based on half of the samples, where would you put such a helper function?
3. I understand where the bottleneck of the whole approach lies, but I think it could still be worth propagating the uncertainty from `bridge_sampler` to `bayes_factor` somehow to warn users about possibly unstable results. Something that brms can throw in a manner that is impossible to overlook for users (as we try with the posterior itself as well as with other indicators of model fit). Right now, the print method of `bridge_sampler` just shows the result without any indication of the error measure (or am I mistaken)?

I have the same intuition as you do, but no data or papers to back that up.

The natural way in our package would be to add this to the `bridge_sampler` function which computes the marginal likelihood. It is maybe a nice possibility to have this. The only problem I can see is that with many real life cases, the resulting `stanfit` object would get so huge that it might not actually be feasible. But we will discuss this.

It is true that the `print` method does not have it, only the `summary` method. This is based on the `error_measures` function. Using the example from `?bridge_sampler`:

``````bridge_result
# Bridge sampling estimate of the log marginal likelihood: 1.8378
# Estimate obtained in 4 iteration(s) via method "normal".
summary(bridge_result)
#
# Bridge sampling log marginal likelihood estimate
# (method = "normal", repetitions = 1):
#
#  1.837801
#
# Error Measures:
#
#  Relative Mean-Squared Error: 8.477251e-08
#  Coefficient of Variation: 0.0002911572
#  Percentage Error: 0.0291%
#
# Note:
# All error measures are approximate.
error_measures(bridge_result)
# \$`re2`
# [1] 8.477251e-08
#
# \$cv
# [1] 0.0002911572
#
# \$percentage
# [1] "0.0291%"
``````

We should probably add something like this to the example.

Hi all, I hope itās ok to use this page for some follow-up questions about bridgesampling and brms, as it keep coming up in search queries for Bayes Factors and brms. For background, I have a brmsfit of a GLMM (family = bernoulli) with two grouping factors (participants and items, as is typical in my field). I have 20,000 posterior samples (instead of 2000-6000, which would be sufficient for estimation) and all common model diagnostics look good. (The files are large but if it helps I can upload the models or share a link.)

My goal is to compare a model with and without one of the fixed effects in the model (while keeping the random slope for that effect in both model). Iāve used bridge_sampler with 4 repetitions (maxiter = 1000) and method = āwarp3ā. The estimated BF that I obtain for the model comparison vary widely, including runs that seem to clearly favor one or the other model. Along with this come the well-known warning that ālogml could not be estimated within maxiter, rerunning with adjusted starting value.
Estimate might be more variable than usual.ā Iām wondering whether adding more samples will really solve the problem, or whether the issue is in the fact that the sampler tries to explore the ātypicalā region of the posterior distribution (rather than the tails)? Some specific questions I have:

1. Does bridge_sampler ācareā about the number of chains that provide the posterior samples, or are all posterior samples from the brms model pooled? I currently have used brms::combine_models() to combine various runs of the same model, so that 20,000 samples come from 8 rather than 4 chains.

2. The discussion above mentioned that the priors for estimation tend to be suboptimal as priors for model comparison. Currently, the brms models are fit with weakly regularizing priors, following recommendations at Prior Choice Recommendations Ā· stan-dev/stan Wiki Ā· GitHub. Is that a bad idea for the purpose of model comparison? (if so, perhaps the wiki page could be edited? @andrewgelman)

3. Iāve read the vignette for using bridge_sampler with rstanfits, and Iām about half-way through the āBayes Factor Workflowā paper by @vasishth @bnicenboim @paul.buerkner @andrewgelman and @betanalpha ā very helpful!) but I havenāt yet found recommendations that would deal with the type of problem Iām encountering. Any further reading suggestions would be welcome.

1. Alternative suggestions for testing the null would be appreciated, too. Iāve considered using brmsās hypothesis function to test the point hypothesis (via the Savage-Dickey method) but, based on what Iāve read, that also is likely to lead to unreliable (or even biased) estimations?

Thank you and apologies for āre-heatingā this old thread.

I do not recommend using a Bayes factor. If you want to combine or compare the models, I recommend using leave-one-out cross validation and stacking; see here:

http://www.stat.columbia.edu/~gelman/research/published/stacking_paper_discussion_rejoinder.pdf

Thank you. Weāve been using loo in other projects. For some reason, it completely escaped me to use it for the present purpose of testing the null. (to clarify, the ācombinationā of models I was referring to was only the combination of chains / samples from different runs of the same model).

For the present purpose, I donāt need model stacking or other ways to derive predictions from sets of models but rather am interested in a measure of the support for the null (the two models differ in one parameter). Would you then report the ELPD difference between the models along with its SE, and small differences (say less than 2 SEs) would indicate that the null model largely has the same predictive accuracy as the model with the additional parameter?

Hi Florian,

You wrote:

" The discussion above mentioned that the priors for estimation tend to be suboptimal as priors for model comparison. Currently, the brms models are fit with weakly regularizing priors, following recommendations at Prior Choice Recommendations Ā· stan-dev/stan Wiki Ā· GitHub. Is that a bad idea for the purpose of model comparison? (if so, perhaps the wiki page could be edited?"

From your message, it sounds like you are letting brms determine the priors. I would never do that for BF calculations (or even otherwise, unless I have literally no idea what the priors should beāthis hardly ever happens). Just as a sanity check, could you try using more informative priors, determined using prior predictive checks?

What happens when you fit the model using anova() in R?

3 Likes

Another thing I would want to do is using simulated data that is similar to your data, with ground truth known.

1 Like

Hi Shravan,