Loo_model_weights() returning inconsistent result

I am working on creating a reproducible example, but in the meantime thought I would ask this now in case there is a known solution. Also, this is my first forum posting, I have tried to format correctly using code chunks etc., but I apologize if it’s not correct!

My issue is that when I compare candidate models using the loo_model_weights() function I receive inconsistent model weights.

I realized this issue while responding to some comments for a manuscript and running some alternative models using brm(). I ran the original analysis on an old computer. Before fitting the new models, I re-fit the original models on my new machine as a start. There were six total models, a “global” model and a suite of simplified models (removing predictor variables from the global model). I fit the global model, and then used the update() function to sequentially remove predictor variables.

mod1<- brm(data = d,
            y ~ x1 + x2 + x3 + x4 + x5 + (1 | siteID) + (1 | year), 
            family = gaussian(),
            prior =c(
                 prior(normal(0,0.25), class = "b"),
                prior(normal(-1.5, 1), class = "Intercept"),
                prior(exponential(2), class="sd"),
                prior(exponential(2), class="sigma")),
            chains = 4, 
            sample_prior = FALSE,
            iter = 6000,
            cores = 4,
            control = list(adapt_delta = 0.99))

mod2 <- update(mod1,
               formula. = . ~ . -x2 -x3,
               cores = 4)

etc.

For each model, I run a loo analysis using

loo_1 <- loo(mod1, reloo = TRUE, seed = TRUE, cores = 6)
loo_2 <- loo(mod1, reloo = TRUE, seed = TRUE, cores = 6)

The seed = TRUE was a solution I found previously after the futures package was updated. This was approximately 6 months ago, and I was unable to re-find the forum post where I got that solution. As a sidebar, it seems like this is may be not needed anymore? I ran loo() with and without it, and my main problem remains the same, but thought I would mention it here.

Then to compare them, I use loo_model_weights() as such:

loo_model_weights(
  list(model1 = loo_1,
       model2 = loo_2,
       model3 = loo_3,
       model4 = loo_4,
       model5 = loo_5,
       model6 = loo_6))

When I reran this code on my new computer, the function returned a different set of model weights compared to what I reported in the manuscript. My previous “best” model still received a majority of the weight, but the other models now received different relative weights.

I have re-fit the models and re-run the loo analysis and every time I get a different set of relative model weights. At first I thought the “best” model was the same in every run. However, after running it 5 or 6 times now, the relative model weights seem to vary widely.

I’m not sure if this is a new issue or if this was the case when I ran them originally ~6 months ago. When I first performed the analysis, I reported the weights as returned, and did not try to re-run them, assuming they would be consistent.

I understand that re-fitting the models could produce some variance in the parameter estimates through random sampling, so a little variation in the estimated model weights could be expected.

I’m just curious if this is a bug, if my packages are incompatible, or some other software is incorrect? Perhaps my models or analysis is not specified correctly? Or perhaps I’m way off the mark and the loo_model_weights are not expected to return consistent results?

Thank you!

Here is my session info:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] viridis_0.6.0     viridisLite_0.4.0 tidybayes_2.3.1   brms_2.15.0      
 [5] Rcpp_1.0.6        forcats_0.5.1     stringr_1.4.0     dplyr_1.0.5      
 [9] purrr_0.3.4       readr_1.4.0       tidyr_1.1.3       tibble_3.1.1     
[13] ggplot2_3.3.3     tidyverse_1.3.1  

loaded via a namespace (and not attached):
  [1] minqa_1.2.4          colorspace_2.0-0     ellipsis_0.3.2      
  [4] ggridges_0.5.3       rsconnect_0.8.17     markdown_1.1        
  [7] base64enc_0.1-3      fs_1.5.0             rstudioapi_0.13     
 [10] listenv_0.8.0        farver_2.1.0         rstan_2.26.1        
 [13] svUnit_1.0.6         DT_0.18              fansi_0.4.2         
 [16] mvtnorm_1.1-1        lubridate_1.7.10     xml2_1.3.2          
 [19] codetools_0.2-18     bridgesampling_1.1-2 splines_4.0.3       
 [22] shinythemes_1.2.0    bayesplot_1.8.0      projpred_2.0.2      
 [25] jsonlite_1.7.2       nloptr_1.2.2.2       broom_0.7.6         
 [28] dbplyr_2.1.1         ggdist_2.4.0         shiny_1.6.0         
 [31] compiler_4.0.3       httr_1.4.2           backports_1.2.1     
 [34] assertthat_0.2.1     Matrix_1.2-18        fastmap_1.1.0       
 [37] cli_2.5.0            later_1.2.0          prettyunits_1.1.1   
 [40] htmltools_0.5.1.1    tools_4.0.3          igraph_1.2.6        
 [43] coda_0.19-4          gtable_0.3.0         glue_1.4.2          
 [46] reshape2_1.4.4       V8_3.4.2             cellranger_1.1.0    
 [49] vctrs_0.3.8          nlme_3.1-152         crosstalk_1.1.1     
 [52] globals_0.14.0       ps_1.6.0             lme4_1.1-26         
 [55] rvest_1.0.0          mime_0.10            miniUI_0.1.1.1      
 [58] lifecycle_1.0.0      gtools_3.8.2         statmod_1.4.35      
 [61] future_1.21.0        MASS_7.3-53.1        zoo_1.8-9           
 [64] scales_1.1.1         colourpicker_1.1.0   hms_1.0.0           
 [67] promises_1.2.0.1     Brobdingnag_1.2-6    parallel_4.0.3      
 [70] inline_0.3.17        shinystan_2.5.0      curl_4.3.1          
 [73] gamm4_0.2-6          gridExtra_2.3        StanHeaders_2.26.1  
 [76] loo_2.4.1            stringi_1.5.3        dygraphs_1.1.1.6    
 [79] checkmate_2.0.0      boot_1.3-28          pkgbuild_1.2.0      
 [82] rlang_0.4.11         pkgconfig_2.0.3      matrixStats_0.58.0  
 [85] distributional_0.2.2 lattice_0.20-44      rstantools_2.1.1    
 [88] htmlwidgets_1.5.3    processx_3.5.2       tidyselect_1.1.1    
 [91] parallelly_1.25.0    plyr_1.8.6           magrittr_2.0.1      
 [94] R6_2.5.0             generics_0.1.0       DBI_1.1.1           
 [97] pillar_1.6.0         haven_2.4.1          withr_2.4.2         
[100] mgcv_1.8-35          xts_0.12.1           abind_1.4-5         
[103] modelr_0.1.8         crayon_1.4.1         arrayhelpers_1.1-0  
[106] utf8_1.2.1           grid_4.0.3           readxl_1.3.1        
[109] callr_3.7.0          threejs_0.3.3        reprex_2.0.0        
[112] digest_0.6.27        xtable_1.8-4         httpuv_1.6.0        
[115] RcppParallel_5.1.2   stats4_4.0.3         munsell_0.5.0       
[118] shinyjs_2.0.0       

Can you show how much variation there is?

I’m not sure exactly which variation you are asking about.

Here is an example of the model coefficient estimates based on 5 runs of the simplest model:

#m5, 5 separate runs
# MOD    Estimate              Est.Error        Q2.5         Q97.5              coef_name
# 1  run1 -1.31788164   0.02745182   -1.37410674   -1.2647062707   Intercept
# 2  run2 -1.31730289   0.02768436   -1.37192762   -1.2614873743   Intercept
# 3  run3 -1.31794519   0.02638331   -1.37264826   -1.2673663424   Intercept
# 4  run4 -1.31752765   0.02689753   -1.37232430   -1.2665930655   Intercept
# 5  run5 -1.31829473   0.03207893   -1.38110757   -1.2607820794   Intercept

# 6  run1 -0.03193247   0.01477063   -0.06080798 -0.0026958692     mat.c
# 7  run2 -0.03156844   0.01521796   -0.06168650 -0.0005007655     mat.c
# 8  run3 -0.03145176   0.01478489   -0.06063284 -0.0022672292     mat.c
# 9  run4 -0.03169886   0.01470793   -0.06015165 -0.0024072946     mat.c
# 10 run5 -0.03137884  0.01493217  -0.06108216 -0.0019326990     mat.c

And here is 5 runs of a model with two additional predictor variables:

#m2, 5 separate runs
# MOD     Estimate        Est.Error         Q2.5               Q97.5           coef_name
# run1 -1.315260040   0.02885907   -1.374045196   -1.259870994   Intercept
# run2 -1.314204578   0.02797678   -1.369151397   -1.257834652   Intercept
# run3 -1.314528295   0.02656328   -1.368891603   -1.263384278   Intercept
# run4 -1.314436063   0.02730919   -1.370944743   -1.258044571   Intercept
# run5 -1.314506126   0.02581979   -1.365308218   -1.262443022   Intercept

# run1 -0.038917465   0.01516069   -0.069208559   -0.008609550     mat.c
# run2 -0.039139009   0.01540670   -0.069999258   -0.008331578     mat.c
# run3 -0.039253258   0.01514418   -0.069678895   -0.009033195     mat.c
# run4 -0.039038648   0.01513483   -0.068389930   -0.008349647     mat.c
# run5 -0.039034168   0.01492491   -0.068164770   -0.009065406     mat.c

# run1  0.007648484   0.01413876   -0.020563842  0.035440357       tdn
# run2  0.008024670   0.01421170   -0.021177260  0.035885525       tdn
# run3  0.008043082   0.01423622   -0.020508256  0.035909131       tdn
# run4  0.007897705   0.01431778   -0.020464370  0.036545798       tdn
# run5  0.008107296   0.01411709   -0.020020868  0.036036824       tdn

# run1  0.035905212   0.02235963   -0.009236896  0.078884676       tdp
# run2  0.035791968   0.02255516   -0.009670718  0.078734137       tdp
# run3  0.036192866   0.02233158   -0.008058267  0.079846615       tdp
# run4  0.036233605   0.02210314   -0.008219619  0.079292852       tdp
# run5  0.035870075   0.02249603   -0.008918133  0.079919496       tdp

Below are pasted the estimated model weights for all 6 models based on 5 separate model fits. each model in each run was run through loo() and the weights came from the loo_model_weights() function.

      run1  run2    run3  run4   run5
#m1: 0.115, 0.000, 0.000, 0.000, 0.000,  
#m2: 0.000, 0.170, 0.000, 0.000, 0.171,
#m3: 0.000, 0.000, 0.000, 0.000, 0.097,
#m4: 0.000, 0.000, 0.000, 0.000, 0.000,
#m5: 0.883, 0.830, 0.80 , 0.44 , 0.69
#m6: 0.000, 0.000, 0.20 , 0.56 , 0.042

Thanks for the reply, I appreciate the help.

Sorry for the delay, been busy with other things

The variation you mentioned in your post:

Can you also show the loo output for run1 and run2 for models m1, m5, and m6, and corresponding loo_comparison() outputs showing elpd_diff and diff_se? Those would help to clarify the source of the variation.

Below is the output you asked for. FYI - these were re-fit again this morning in response to your comment (I have not been saving all of the model fits).

> loo_1
Computed from 12000 by 160 log-likelihood matrix

         Estimate   SE
elpd_loo    129.3 20.4
p_loo        20.8  6.5
looic      -258.6 40.9
------
Monte Carlo SE of elpd_loo is 0.3.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
> loo_1b

Computed from 12000 by 160 log-likelihood matrix

         Estimate   SE
elpd_loo    129.2 20.4
p_loo        21.0  6.6
looic      -258.4 40.8
------
Monte Carlo SE of elpd_loo is 0.3.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
> loo_5

Computed from 12000 by 160 log-likelihood matrix

         Estimate   SE
elpd_loo    131.2 20.8
p_loo        19.4  6.6
looic      -262.4 41.7
------
Monte Carlo SE of elpd_loo is 0.4.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     159   99.4%   8         
 (0.5, 0.7]   (ok)         1    0.6%   6151      
   (0.7, 1]   (bad)        0    0.0%   <NA>      
   (1, Inf)   (very bad)   0    0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.
> loo_5b

Computed from 12000 by 160 log-likelihood matrix

         Estimate   SE
elpd_loo    131.1 20.8
p_loo        19.4  6.6
looic      -262.3 41.7
------
Monte Carlo SE of elpd_loo is 0.2.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
> loo_6

Computed from 12000 by 160 log-likelihood matrix

         Estimate   SE
elpd_loo    129.5 20.6
p_loo        20.5  6.6
looic      -259.0 41.3
------
Monte Carlo SE of elpd_loo is 0.3.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.
> loo_6b

Computed from 12000 by 160 log-likelihood matrix

         Estimate   SE
elpd_loo    129.5 20.6
p_loo        20.5  6.6
looic      -259.1 41.3
------
Monte Carlo SE of elpd_loo is 0.2.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

I don’t see a loo_comparison() but I assume loo_compare() is the correct output?

> loo_compare(loo_1, loo_5, loo_6)
     elpd_diff se_diff
mod5  0.0       0.0   
mod6 -1.7       1.9   
mod1 -1.9       2.0 
> loo_compare(loo_1b, loo_5b, loo_6b)
      elpd_diff se_diff
mod5b  0.0       0.0   
mod6b -1.6       1.9   
mod1b -1.9       2.0   
> loo_compare(loo_1, loo_1b, loo_5, loo_5b, loo_6, loo_6b)
      elpd_diff se_diff
mod5   0.0       0.0   
mod5b -0.1       0.1   
mod6b -1.7       1.9   
mod6  -1.7       1.9   
mod1  -1.9       2.0   
mod1b -2.0       2.0