Model selection with loo and bridge sampling

I have run a few mixed-effects models in brms with 3 categorical predictors and one continuous. I have used loo to compare main effects and interaction modes with and without some of the predictors. I’m not sure how to interpret the results.

Model comparisons:

                         elpd_diff se_diff
model1          0.0       0.0   
model2        -1.1       2.2   
model3        -1.1       2.2   
model4        -1.2       3.1   
model5        -1.2       2.6   
model6        -2.1       2.8 

For me, it is hard to get an intuitive sense of what constitutes a large elpd_diff and se_diff. It seems to me that there are no huge differences between the predictive validity of the models. To confirm, I used bridge sampling to compare model1 vs. the other models. BFs indicated that model3 was ~14 in favor of model1. The other models were either weakly in favor or inconclusive compared to model1, except model5, where a BF of ~1000 preferred model1.

My next idea was to average over the models. I used the loo_model_weights function with the pseudobma method to obtain model weights.

Method: pseudo-BMA+ with Bayesian bootstrap
model1 	0.341
model2	0.123
model3	0.131
model4	0.161
model5	0.119
model6	0.124

Considering these results, it seems to me that none of the models clearly fit the data “best.” Intuitively I think the best approach would be to average over all of the models. Does this sound sensible?

Is it possible to use the hypothesis function to perform testing on the averaged parameters?

Hi, sorry for not getting to you earlier, this is relevant question.

A very rough rule of thumb is that a elpd_diff larger than 2 * se_diff is bigger than most of the noise we have in evaluting elpd (which could mean that it is large or that there is little noise). Agree that this looks like not very big differences, but I find it more useful to use the model weights (as you did).

The fact that you get different results with loo and with bridgesampling is not surprising, both answer quite different questions (my current best thoughts on this are at Hypothesis testing, model selection, model comparison - some thoughts ). In particular, Bayes factors do weird stuff when none of your models is a good fit for the data, while loo is mostly robust to this.

The loo results you see are indeed indicative of no of the models working much better than others in leave-one-out cross validation (as reflected by the model weights). Averaging over the models makes sense if your goal is out of sample prediction.

I don’t think so - note that those weights do not average over parameters, they average over predictions - I would expect many parameters are not even shared by the individual models, so I don’t think you could even meaningfully define what would be the expected behaviour. What you can do is to make predictions from the “ensemble” model and interpret those (e.g. how big change does the model predict for reassigning all the subjects to one of the treatments group).

Hope that clarifies more than confuses.

Best of luck with your modelling.