Bayes factor using brms


#1

Hi all,

Attached are some data that I am trying to do a Bayes factor analysis on using brms.
Reproducible code is below. I set pretty reasonable priors for the fixed effects factors. I am testing the null hypothesis that the parameter for the factor dat is 0, against the alternative prior that the parameter for dat is Normal(0,1). After fitting the full and null models, i.e., models with and without dat as a fixed factor, I computed bayes factor using the function bayes_factor in brms 100 times, and each time I get a new number:

summary(bf_values)
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05725 0.12663 0.19257 0.23821 0.30977 0.85361

So the BF in favor of the alternative is between 1 and 17! Once the models are fitted, why does the bayes factor vary so much?

dat.txt (16.2 KB)

# load data:
dat<-read.table("dat.txt",header=T)

library(brms)
#- Null: Effect of Dat = 0
#- Alternative: Normal(0,1) for all the fixed effects factors



priors_logrt_normal<-c(set_prior("normal(0,10)", class = "b"),
                       set_prior("normal(0,1)", class = "b",coef="dat"), 
                       set_prior("normal(0,1)", class = "b",coef="adj"),
                       set_prior("normal(0,1)", class = "b",coef="int"), 
                       set_prior("normal(0,1)", class = "sd"), 
                       set_prior("normal(0,1)", class = "sigma"), 
                       set_prior("lkj(2)", class = "cor"))


full1 <-brm(formula = log(region7) ~ dat + adj + int  +
                       (1 + 
                          dat + adj + int
                        | subj)+ (1 + 
                                    dat + adj + int
                                  | item),
                     data = reading_time_nozeros, 
                     family = gaussian(),
                     prior = priors_logrt_normal,
                     warmup = 1000, 
                     iter = 2000, 
                     chains = 4,
                     save_all_pars= TRUE,
                     control = list(adapt_delta = 0.99,max_treedepth=15))

## null model:
NULLpriors_logrt_normal<-c(set_prior("normal(0,10)", class = "b"),                
                       set_prior("normal(0,1)", class = "b",coef="adj"),
                       set_prior("normal(0,1)", class = "b",coef="int"), 
                       set_prior("normal(0,1)", class = "sd"), 
                       set_prior("normal(0,1)", class = "sigma"), 
                       set_prior("lkj(2)", class = "cor"))

null1 <- brm(formula = log(region7) ~ adj + int  +
                        (1 + dat+ adj + int
                         | subj)+ (1 + 
                                     dat + adj + int
                                   | item),
                      data = reading_time_nozeros, 
                      family = gaussian(),
                      prior = NULLpriors_logrt_normal,
                      warmup = 1000, 
                      iter = 2000, 
                      chains = 4,
                      save_all_pars= TRUE,
                      control = list(adapt_delta = 0.99,max_treedepth=15))

set.seed(4321)

bf_values<-rep(NA,100)
for(i in 1:100){
bf01<-bayes_factor(null1,full1)
bf_values[i]<-bf01$bf
}

summary(bf_values)
 Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.05725 0.12663 0.19257 0.23821 0.30977 0.85361

sessionInfo()

R version 3.4.4 (2018-03-15)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets 
[6] methods   base     

other attached packages:
 [1] rlist_0.4.6.1    HDInterval_0.1.3 lattice_0.20-35 
 [4] magrittr_1.5     dplyr_0.7.4      tidyr_0.8.0     
 [7] gdata_2.18.0     brms_2.1.9       ggplot2_2.2.1   
[10] Rcpp_0.12.16     devtools_1.13.5 

loaded via a namespace (and not attached):
 [1] Brobdingnag_1.2-5    httr_1.3.1          
 [3] gtools_3.5.0         StanHeaders_2.17.2  
 [5] threejs_0.3.1        shiny_1.0.5         
 [7] assertthat_0.2.0     stats4_3.4.4        
 [9] yaml_2.1.19          pillar_1.2.1        
[11] backports_1.1.2      glue_1.2.0          
[13] digest_0.6.15        colorspace_1.3-2    
[15] htmltools_0.3.6      httpuv_1.3.6.2      
[17] Matrix_1.2-12        plyr_1.8.4          
[19] dygraphs_1.1.1.4     pkgconfig_2.0.1     
[21] rstan_2.17.3         purrr_0.2.4         
[23] xtable_1.8-2         mvtnorm_1.0-7       
[25] scales_0.5.0         git2r_0.21.0        
[27] tibble_1.4.2         bayesplot_1.4.0     
[29] DT_0.4               withr_2.1.2         
[31] shinyjs_1.0          lazyeval_0.2.1      
[33] mime_0.5             memoise_1.1.0       
[35] evaluate_0.10.1      papaja_0.1.0.9735   
[37] nlme_3.1-131.1       xts_0.10-2          
[39] colourpicker_1.0     data.table_1.10.4-3 
[41] rsconnect_0.8.8      tools_3.4.4         
[43] loo_1.1.0            matrixStats_0.53.1  
[45] stringr_1.3.1        munsell_0.4.3       
[47] bindrcpp_0.2         compiler_3.4.4      
[49] rlang_0.2.0          grid_3.4.4          
[51] htmlwidgets_1.0      crosstalk_1.0.0     
[53] igraph_1.2.1         miniUI_0.1.1        
[55] base64enc_0.1-3      rmarkdown_1.9.8     
[57] gtable_0.2.0         codetools_0.2-15    
[59] inline_0.3.14        abind_1.4-5         
[61] curl_3.1             markdown_0.8        
[63] reshape2_1.4.3       R6_2.2.2            
[65] gridExtra_2.3        rstantools_1.4.0    
[67] zoo_1.8-1            knitr_1.20          
[69] bridgesampling_0.4-0 bindr_0.1.1         
[71] shinystan_2.4.0      shinythemes_1.1.1   
[73] rprojroot_1.3-2      stringi_1.2.2       
[75] parallel_3.4.4       coda_0.19-1

#2

This may happen with the bridgesampling method and I agree this is problematic…

For bridgesampling to yield stable results you may need much more posterior samples. Say iter = 10000, warmup = 1000, chains = 4.


#3

Thanks. Is it possible to set a default of say 10 runs in bayes_factor, and then issue a warning if results are beyond a certain tolerance that can be fixed as a parameter? The instability could be a diagnostic like the R-hat. I noticed the problem already after I ran it 2-3 times.


#4

I think this is something you have to ask the developers of the bridgesampling package.


#5

Tagging @Quentin


#6

This is one of the reasons why, despite all the smart people working on this and the nice packages (bridgesampling is really
cool), I’m still very wary of most uses of Bayes factors in the wild. That said, I think something like @vasishth is proposing would be a good option for the bridgesampling package to provide.


#7

I think it‘s unavoidable that people think BDA=Bayes factor. I just saw a post yesterday on a blog saying this. Better to make it part of the core Stan documentation else you will have people like me reporting unstable numbers. I hereby offer my services in that regard.


#8

BTW, Paul’s suggestion to increase the number of iterations worked. I get some small differences between repeated computations of the Bayes factor, but it’s basically stable.

Also, now the results also make sense.


#9

I think what we need is a way to diagnose convergence problems of the bridgesampling algorithm that tells us if the value it converged to may be unstable.


#10

The default method for obtaining the marginal likelihood using the bridgesampling package already provides an error measure. It uses the formula for the approximate relative mean-squared error developed by Frühwirth–Schnatter (2004). So one could use this to investigate whether or not the marginal likelihood on which the Bayes factor is based is precise enough. However, we do not propagate this uncertainty to the calculation of the Bayes factor. I am not sure if there is a straight-forward way to do so, but this is also not super important.

The main issue is that bridge sampling, like other sampling approaches, is a numerical method for which diagnosing convergence problems is generally not trivial. The added difficulty in our case is that the data that is used for sampling is itself a sample generated by an MCMC chain. Thus, there are two levels of numerical uncertainty; uncertainty from the posterior and from the bridge sampler.

So the only real way to ensure stability of the sampler is to do the equivalent of running independent MCMC chains, as done in the initial code here. Run Stan multiple time to receive multiple independent samples from the posterior (of course each of those already consists of multiple independent chains). For each of those sets of posterior samples obtain at least one estimate of the marginal likelihood. If the estimates of the marginal likelihood are all near enough to each other (e.g., considering the magnitude of differences in marginal likelihoods between the different models), the Bayes factor will be kosher. If not, usually more samples from the posterior distribution are necessary.

So, to us it is not immediately apparent what kind of check to add to our package. The bridge_sampler function already contains the argument repetitions which allows to obtain more than one marginal likelihood estimate from one fixed set of posterior samples. This allows an estimation of the uncertainty on the second level. However, to get a full overview of the uncertainty a new set of posterior samples is necessary.

Perhaps most importantly, estimating the marginal likelihood usually requires at least one order of magnitude more samples than estimation. We warn about this both in our paper and the help page. For example (from ?bridgesampling::bridge_sampler):

Also note that for testing, the number of posterior samples usually needs to be substantially larger than for estimation.

Maybe the easiest solution would be to add a similar warning to the help page of brms::bayes_factor. And also encourage users to get at least two independent sets of posterior samples and estimates of the Bayes factor.


#11

Hi Henrik,

thanks for all the explanation. This is indeed very helpful. in the brms doc I point to the bridgesampling doc but I will add some more information to the former to make sure people will read it.

And please excuse my naivety regarding convergence diagnostics for bridge sampling. I see that such warnings would be non-trivial to construct, still I think it would be nice to have it (even if this is wishful thinking; you understand that better).

Still I think the situtation is not ideal. Perhaps, I could add an option to brms::bayes_factor that allows to automatically compute the marginal likelihood multiple times and reports a vector of bayes factors so that variability in the latter is immediately visible.

As another perhaps naive question: Could we combine multiple estimates of the marginal_likelihood (computed using repetitions) somehow to get a better estimate (maybe you covered this somewhere already)?


#12

And please excuse my naivety regarding convergence diagnostics for bridge sampling. I see that such warnings would be non-trivial to construct, still I think it would be nice to have it (even if this is wishful thinking; you understand that better).

Still I think the situtation is not ideal. Perhaps, I could add an option to brms::bayes_factor that allows to automatically compute the marginal likelihood multiple times and reports a vector of bayes factors so that variability in the latter is immediately visible.

As I said before. The problem is that, even if the bridge sampler has converged with perfect accuracy, this is conditional on one specific set of posterior samples. A fully adequate assessment of convergence requires at least two independent sets of posterior samples. And this is outside of the scope of our package. So we will think about this to see if we can add something, but this will be only a bad solution. My advice: Any paper that reports Bayesian model selection based on marginal likelihoods needs to have calculated the Bayes factor or posterior model probabilities based on at least two independent sets of posterior distributions (best solution is to look at all possible combinations across the different estimates and models).

As another perhaps naive question: Could we combine multiple estimates of the marginal_likelihood (computed using repetitions) somehow to get a better estimate (maybe you covered this somewhere already)?

Hmm, if the estimate is unstable, this indicates too few posterior samples. I do not know of any theoretical results, but simply averaging or something like this in my experience does not guaranteed to converge on the true value. Unfortunately, more samples from the posterior distribution are necessary in this case.

I know that all these suggestions really make the calculation of Bayes factors using bridge sampling quite expensive in terms of time and computational resources. Unfortunately, it is a inherently difficult problem due to the two levels of uncertainty. The solution appears to require quite a lot of samples. But at least there is a solution.


#13

Thank you very much for your insights. I just added more stuff to the brms doc to reflect what you have said.

I think bridgesampling is an amazing package to treat a very difficult but relevant problem. I don’t know of any other comparably general approach to compute the marginal likelihood so I think the additional computational time is well worth it and I am more than happy to support it in brms :-)

I have a couple of minor follow up questions if you don’t mind:

  1. What’s is it exactly that enables bridgesampling to perform so much more accurately for higher number of samples? I assume it is the tails of the distribution, which tend to be estimated poorly with too few samples?
  2. If one wanted to have a helper function to split the posterior samples in half (say, by splitting the number of chains in case of equal number of chains) and to compute two bayes factors / marginal likelihoods each based on half of the samples, where would you put such a helper function?
  3. I understand where the bottleneck of the whole approach lies, but I think it could still be worth propagating the uncertainty from bridge_sampler to bayes_factor somehow to warn users about possibly unstable results. Something that brms can throw in a manner that is impossible to overlook for users (as we try with the posterior itself as well as with other indicators of model fit). Right now, the print method of bridge_sampler just shows the result without any indication of the error measure (or am I mistaken)?

#14

I have the same intuition as you do, but no data or papers to back that up.

The natural way in our package would be to add this to the bridge_sampler function which computes the marginal likelihood. It is maybe a nice possibility to have this. The only problem I can see is that with many real life cases, the resulting stanfit object would get so huge that it might not actually be feasible. But we will discuss this.

It is true that the print method does not have it, only the summary method. This is based on the error_measures function. Using the example from ?bridge_sampler:

bridge_result
# Bridge sampling estimate of the log marginal likelihood: 1.8378
# Estimate obtained in 4 iteration(s) via method "normal".
summary(bridge_result)
# 
# Bridge sampling log marginal likelihood estimate 
# (method = "normal", repetitions = 1):
# 
#  1.837801
# 
# Error Measures:
# 
#  Relative Mean-Squared Error: 8.477251e-08
#  Coefficient of Variation: 0.0002911572
#  Percentage Error: 0.0291%
# 
# Note:
# All error measures are approximate.
error_measures(bridge_result)
# $`re2`
# [1] 8.477251e-08
# 
# $cv
# [1] 0.0002911572
# 
# $percentage
# [1] "0.0291%"

We should probably add something like this to the example.