In approximate LOO, Vehtari et al. (2017) advise against truncated importance sampling because it introduces a bias. However, I am not sure I completely understand the paper’s evidence for that bias. Is there a computationally efficient way to figure out if TIS is still usable for a given model averaging problem? loo::loo() supports TIS, and I would very much like to use it because the alternatives all have tangible issues:

Pareto-smoothed importance sampling always reports high k values on 1-5% of my data, even using datasets simulated from one of the models. Both models in the ensemble are already so simple that I do not think I can reparameterize. The problem persists regardless of how diffuse or concentrated the priors are.

K-fold cross-validation is computationally expensive. And in simulation studies with K = 10, I am finding that the model averaging weights are highly variable because of how many different random K-fold partitions are possible.

The latest version of Pareto Smoothed Importance Sampling paper has more details on the bias and variance of PSIS, TIS, and IS. If Pareto-k diagnostic gives warnings, then TIS is still a worse choice than PSIS, TIS just doesn’t have that diagnostic. TIS was included in the package to make it easier for others to replicate the experiments, but it is better to use PSIS instead TIS.

Can you tell more about your models? There are some possible ways forward and you can check some of these in CV-FAQ, but I can also help if you tell more.

Thank you so much for offering to help, @avehtari. I am working on a historical borrowing problem, and I am trying to average the independent and pooled models from Methods • historicalborrowlong. Here, the weight on the pooled model acts as a commensurability metric among the current and the historical data sources.

As I revisit my computational experiments in more depth, I am realizing that I can only reproduce k > 0.7 on real (and confidential) clinical trial data. In the simpler non-longitudinal case, k drops below 0.7 if I stop adjusting for baseline covariates (one of which is a complicated factor with many levels). In the non-longitudinal case without covariate adjustment, I am still seeing around 40% of the Pareto k values (around 60 in all) above 1 on real data, with p_loo around 390-400, even with the moment-matching correction.

The longitudinal non-covariate-adjusted pooled model has 178 main parameters, and the equivalent independent model has 196 main parameters. However, 24% of the response values are missing, and the model imputes them as model parameters. If I run the analysis on just the completers (patients with no missing responses) and use simpler correlation structures instead of unstructured, then Pareto k values do drop below 0.7. (AR(1) worked for the pooled model. It almost worked for the independent model, but there was one Pareto k value of 1.43, and I had to drop down to a diagonal correlation matrix to completely eliminate diagnostic issues.) I am not sure my team would be comfortable with all this simplification for the final analysis models.

Is there a way to preserve model complexity and analyze missing data correctly while still achieving Pareto k values below 0.7? If not, would it be at all reasonable to use simple models to compute the weights and then use those weights to average the more complicated versions of those models? I think the weights would still serve their purpose as a commensurability metric, and I suspect I would still see strong operating characteristics in simulation studies, but I am concerned that this shortcut may not be statistically correct.

You could try mixture Importance sampling by Silva and Zanella (2022). There is a PR for a vignette, and you can see the markdown vignette in a git branch. It’s quite easy to implement, and if the mixture distribution is not strongly multimodal it can work better than the moment matching.

Wow, mixture IS looks elegant, clever, direct, and as you say, easy to implement. Thank you so much for the reference. Although I am still connecting the dots, I am eager to try out the method.

One naive question: in a model averaging context, is it permissible to perform the usual Bayesian inference directly on MCMC samples of theta from the marginal q_mix(theta), or should I use the weights from mixture IS to average a completely different set of samples from the original posterior p(theta | y)? I ask because the latter seems to require fitting each model twice.

MixtureIS requires two MCMC runs. One using the actual target posterior and one using the mixture posterior.

Thanks for confirming, that makes sense.

Can you tell more about the goal of your model averaging? It helps to know about use cases when developing methods and workflows

It is early enough in development that I hesitate to elaborate more in public forums, but I would be happy to follow up in a private email thread if that works for you. Thank you so much for pointing me in a promising direction for the LOO CV piece.

No hurry. I’ve not yet seen any good real example where model averaging would be better than model expansion and more carefully thought priors, so I keep asking, and happy if you remember to notify me when you can tell how did it work in y your problem.

Can you point to good applied examples where model expansion was not feasible and model averaging was an important part? The methodological papers show that averaging is better than single model, but rarely compare to proper model expansion.