Bug in brms bayes_factor for model with me() term?

The function bayes_factor in brms behaves oddly when the model includes a term with measurement error - me(F,sdF1)
I wish to estimate the effect of an environmental factor, F, on the numbers of people with low body weight. The design is a multi-level poisson regression that has repeated measures within counties, within states within the USA.
I constructed (1) a null model - m0 - that has no ‘F’ term; (2) a simple model - m1 - that has a term for ‘F’, but does not include its measurement error; (3) a model - m1e - that includes adjustment for measurement error in F (coded as me(F, sdF)). Here is the (abbreviated) summary for m1e (all Rhats=1.00, all sample sizes>4000):-

     Family: poisson 
      Links: mu = log 
    Formula: LBW ~ 0 + intercept + year + (1 | SCode/County) + year .... + me(F, sdF) + offset(log(population)) 
       Data: dat (Number of observations: 1211) 
    Samples: 4 chains, each with iter = 10000; warmup = 5000; thin = 1;
             total post-warmup samples = 20000

    Group-Level Effects: 
    ~SCode (Number of levels: 27) 
                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
    sd(Intercept)     0.05      0.01     0.03     0.07 1.00     5348     9513

    ~SCode:County (Number of levels: 173) 
                  Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
    sd(Intercept)     0.04      0.00     0.03     0.05 1.00     7493    11669

    Population-Level Effects: 
               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
    intercept     -2.60      0.03    -2.66    -2.54 1.00     4686     6828
    year           0.17      0.02     0.12     0.21 1.00    12543    14845
    .
    .
    .
    meFsdF         0.12      0.04     0.04     0.20 1.00     4029     5524

    Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
    is a crude measure of effective sample size, and Rhat is the potential 
    scale reduction factor on split chains (at convergence, Rhat = 1).

The loo information criteria for the 3 models are similar:-

> loo(m0,m1,m1e)
Output of model 'm0':
Computed from 10000 by 1211 log-likelihood matrix
         Estimate   SE
elpd_loo  -4679.5 32.2
p_loo       115.2  5.4
looic      9359.0 64.4
------
Monte Carlo SE of elpd_loo is 0.1.
Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1207  99.7%   1246      
 (0.5, 0.7]   (ok)          4   0.3%   290       
   (0.7, 1]   (bad)         0   0.0%   <NA>      
   (1, Inf)   (very bad)    0   0.0%   <NA>      

Output of model 'm1':
Computed from 10000 by 1211 log-likelihood matrix
         Estimate   SE
elpd_loo  -4680.1 32.4
p_loo       114.0  5.4
looic      9360.3 64.7
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1209  99.8%   1195      
 (0.5, 0.7]   (ok)          1   0.1%   1440      
   (0.7, 1]   (bad)         1   0.1%   286       
   (1, Inf)   (very bad)    0   0.0%   <NA>      

Output of model 'm1e':
Computed from 20000 by 1211 log-likelihood matrix
         Estimate   SE
elpd_loo  -4677.9 31.7
p_loo       130.8  6.5
looic      9355.7 63.5
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1202  99.3%   2533      
 (0.5, 0.7]   (ok)          8   0.7%   768       
   (0.7, 1]   (bad)         1   0.1%   222       
   (1, Inf)   (very bad)    0   0.0%   <NA>      

Model comparisons:
    elpd_diff se_diff
m1e  0.0       0.0   
m0  -1.6       2.6   
m1  -2.3       1.5   

Now, if I compute the bayes factor for m1 over m0, I obtain a reasonable value:-

set.seed(12345); bayes_factor(m1,m0,log=T,silent=T, reloo=T)
Estimated log Bayes factor in favor of bridge1 over bridge2: -0.46107

but, if I compute the bayes factor for m1e over m0, the result is astronomically large:-

set.seed(12345); bayes_factor(m1e,m0,log=T,silent=T, reloo=T)
Estimated log Bayes factor in favor of bridge1 over bridge2: 334.56230

exp(334.6) is too large to write here.

Why would two models whose looic values are so similar show such a large Bayes factor? Is this a bug in bayes_factor?

Thanks, in anticipation of your help

The priors for m0 and m1 are:

            prior class      coef group resp dpar nlpar bound
1     normal(0,1)     b                                      
2 normal(-2.5,.5)     b intercept                            
3     gamma(1,10)    sd                                      
The priors for m1e are:-
            prior  class      coef group resp dpar nlpar bound
1     normal(0,1)      b                                      
2 normal(-2.5,.5)      b intercept                            
3     gamma(1,10)     sd                                      
4     normal(0,1) meanme                                      
5      gamma(1,5)   sdme                                      
sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
 [1] LC_CTYPE=en_NZ.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_NZ.UTF-8        LC_COLLATE=en_NZ.UTF-8    
 [5] LC_MONETARY=en_NZ.UTF-8    LC_MESSAGES=en_NZ.UTF-8   
 [7] LC_PAPER=en_NZ.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_NZ.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] brms_2.10.0 Rcpp_1.0.3 

loaded via a namespace (and not attached):
 [1] Brobdingnag_1.2-6    gtools_3.8.1         StanHeaders_2.19.0  
 [4] threejs_0.3.1        shiny_1.4.0          assertthat_0.2.1    
 [7] stats4_3.6.1         pillar_1.4.2         backports_1.1.5     
[10] lattice_0.20-38      glue_1.3.1           digest_0.6.22       
[13] promises_1.1.0       colorspace_1.4-1     htmltools_0.4.0     
[16] httpuv_1.5.2         Matrix_1.2-17        plyr_1.8.4          
[19] dygraphs_1.1.1.6     pkgconfig_2.0.3      rstan_2.19.2        
[22] purrr_0.3.3          xtable_1.8-4         scales_1.1.0        
[25] processx_3.4.1       later_1.0.0          tibble_2.1.3        
[28] bayesplot_1.7.0      ggplot2_3.2.1        DT_0.10             
[31] shinyjs_1.0          lazyeval_0.2.2       cli_1.1.0           
[34] magrittr_1.5         crayon_1.3.4         mvnfast_0.2.5       
[37] mime_0.7             ps_1.3.0             nlme_3.1-142        
[40] xts_0.11-2           pkgbuild_1.0.6       colourpicker_1.0    
[43] rsconnect_0.8.15     tools_3.6.1          loo_2.1.0           
[46] prettyunits_1.0.2    lifecycle_0.1.0      matrixStats_0.55.0  
[49] stringr_1.4.0        munsell_0.5.0        callr_3.3.2         
[52] compiler_3.6.1       rlang_0.4.1          grid_3.6.1          
[55] ggridges_0.5.1       rstudioapi_0.10      htmlwidgets_1.5.1   
[58] crosstalk_1.0.0      igraph_1.2.4.1       miniUI_0.1.1.1      
[61] base64enc_0.1-3      gtable_0.3.0         codetools_0.2-16    
[64] inline_0.3.15        abind_1.4-5          markdown_1.1        
[67] reshape2_1.4.3       R6_2.4.1             gridExtra_2.3       
[70] rstantools_2.0.0     zoo_1.8-6            bridgesampling_0.7-2
[73] dplyr_0.8.3          fastmap_1.0.1        shinystan_2.5.0     
[76] shinythemes_1.1.2    stringi_1.4.3        parallel_3.6.1      
[79] tidyselect_0.2.5     coda_0.19-3         

PS Sorry that the formatting in the post is so bizarre - I don’t think I did that!

Ad formatting: you can use triple backticks (```) to mark code/output blocks, I edited the post to show that.

I have little direct experience with Bayes factors, but many people I trust consider them fragile. You might as well be encountering one of the fragile cases. Measurement error models can generally be quite tricky in and of themselves and I am not sure Bayes factor comparison can even be made meaningful between a model with an without a measurement error term - your model gained many dimensions which you need to integrate over.

Checking the stability of the estimate by fitting the model multiple times and computing multiple bayes factors as mentioned at Bayes factor using brms might be worthwhile.

Tagging @Henrik_Singmann for second opinion (if he’s got time to spare).

I know too little about the parameterization of these models to say anything with certainty, but priors for Bayes factors seem to be a non-trivial issue in this case. We generally only recommend Bayes factors if priors are particularly chosen for Bayes factor based model comparison and that seems to be not the case here. So from my perspective you would have to make a good case why the priors are reasonable here.

And I also agree that Bayes factors are numerically unstable, especially in high dimensional cases. So we would recommend getting independent sets of posterior samples and calculating Bayes factors for these independent sets of samples. Only if the precision is high enough on these independent sets of samples would we put trust in them. Usually, one needs a lot of samples to get stable estimates in high dimensional cases.

So my advice: tread carefully.

2 Likes

Thank you, Both. I was unaware of the possibility that the stability of Bayes factors could depend on the prior distributions, or display such instability in high dimensional models.

I used the gamma distribution for the ‘sd’ terms, because the model failed to converge with the default distribution. Paul Buerkner advised that this may be because the random terms explained too much variance so that the sigma term could tend to zero with the default prior. Therefore, he suggested that I used a prior distribution to bound the error away from zero.

Which prior distributions might improve the stability and prevent the problem of sigma tending to zero?

Thanks for your help

I honestly don’t know and I fear you are in quite advanced/complex territory.
The question to ask is whether you actually need Bayes factors for the scientific question you are interested in. Using loo/k-fold cross validation or posterior predictive checks might be a more sensible approach to model selection (search around the forums/dcs, there is quite a lot of examples). If you need assistence with those, it might be sensible to start a new thread.

Best of luck!

2 Likes