{loo} truncated importance sampling?

In approximate LOO, Vehtari et al. (2017) advise against truncated importance sampling because it introduces a bias. However, I am not sure I completely understand the paper’s evidence for that bias. Is there a computationally efficient way to figure out if TIS is still usable for a given model averaging problem? loo::loo() supports TIS, and I would very much like to use it because the alternatives all have tangible issues:

  • Pareto-smoothed importance sampling always reports high k values on 1-5% of my data, even using datasets simulated from one of the models. Both models in the ensemble are already so simple that I do not think I can reparameterize. The problem persists regardless of how diffuse or concentrated the priors are.
  • K-fold cross-validation is computationally expensive. And in simulation studies with K = 10, I am finding that the model averaging weights are highly variable because of how many different random K-fold partitions are possible.
1 Like

@avehtari

The latest version of Pareto Smoothed Importance Sampling paper has more details on the bias and variance of PSIS, TIS, and IS. If Pareto-k diagnostic gives warnings, then TIS is still a worse choice than PSIS, TIS just doesn’t have that diagnostic. TIS was included in the package to make it easier for others to replicate the experiments, but it is better to use PSIS instead TIS.

Can you tell more about your models? There are some possible ways forward and you can check some of these in CV-FAQ, but I can also help if you tell more.

1 Like

Thank you so much for offering to help, @avehtari. I am working on a historical borrowing problem, and I am trying to average the independent and pooled models from Methods • historicalborrowlong. Here, the weight on the pooled model acts as a commensurability metric among the current and the historical data sources.

As I revisit my computational experiments in more depth, I am realizing that I can only reproduce k > 0.7 on real (and confidential) clinical trial data. In the simpler non-longitudinal case, k drops below 0.7 if I stop adjusting for baseline covariates (one of which is a complicated factor with many levels). In the non-longitudinal case without covariate adjustment, I am still seeing around 40% of the Pareto k values (around 60 in all) above 1 on real data, with p_loo around 390-400, even with the moment-matching correction.

The longitudinal non-covariate-adjusted pooled model has 178 main parameters, and the equivalent independent model has 196 main parameters. However, 24% of the response values are missing, and the model imputes them as model parameters. If I run the analysis on just the completers (patients with no missing responses) and use simpler correlation structures instead of unstructured, then Pareto k values do drop below 0.7. (AR(1) worked for the pooled model. It almost worked for the independent model, but there was one Pareto k value of 1.43, and I had to drop down to a diagonal correlation matrix to completely eliminate diagnostic issues.) I am not sure my team would be comfortable with all this simplification for the final analysis models.

Is there a way to preserve model complexity and analyze missing data correctly while still achieving Pareto k values below 0.7? If not, would it be at all reasonable to use simple models to compute the weights and then use those weights to average the more complicated versions of those models? I think the weights would still serve their purpose as a commensurability metric, and I suspect I would still see strong operating characteristics in simulation studies, but I am concerned that this shortcut may not be statistically correct.

You could try mixture Importance sampling by Silva and Zanella (2022). There is a PR for a vignette, and you can see the markdown vignette in a git branch. It’s quite easy to implement, and if the mixture distribution is not strongly multimodal it can work better than the moment matching.

1 Like

Wow, mixture IS looks elegant, clever, direct, and as you say, easy to implement. Thank you so much for the reference. Although I am still connecting the dots, I am eager to try out the method.

One naive question: in a model averaging context, is it permissible to perform the usual Bayesian inference directly on MCMC samples of theta from the marginal q_mix(theta), or should I use the weights from mixture IS to average a completely different set of samples from the original posterior p(theta | y)? I ask because the latter seems to require fitting each model twice.

MixtureIS requires two MCMC runs. One using the actual target posterior and one using the mixture posterior.

Can you tell more about the goal of your model averaging? It helps to know about use cases when developing methods and workflows

MixtureIS requires two MCMC runs. One using the actual target posterior and one using the mixture posterior.

Thanks for confirming, that makes sense.

Can you tell more about the goal of your model averaging? It helps to know about use cases when developing methods and workflows

It is early enough in development that I hesitate to elaborate more in public forums, but I would be happy to follow up in a private email thread if that works for you. Thank you so much for pointing me in a promising direction for the LOO CV piece.

No hurry. I’ve not yet seen any good real example where model averaging would be better than model expansion and more carefully thought priors, so I keep asking, and happy if you remember to notify me when you can tell how did it work in y your problem.

Not having used model averaging myself but showed by such literature, I find this comment surprising.

Can you point to good applied examples where model expansion was not feasible and model averaging was an important part? The methodological papers show that averaging is better than single model, but rarely compare to proper model expansion.

2 Likes

Re model expansion: my team and I would have preferred a finite mixture model, but we had trouble implementing it successfully on our longitudinal data. I might revisit that at some point.

I did implement the method by Silva and Zanella (2022) for those simpler longitudinal models (closed source for now), and it shows promise, both on simulated and real data. However, sampling efficiency plummets when there are subjects with partially missing responses. I cannot confidently say why, but it looks and feels like an identifiability problem: we’re sampling from p(\theta | y) \left ( \sum_{j = 1}^n p(y_j | \theta)^{-1} \right ) (equation 8) instead of just p(\theta | y), and the missing elements of each y_j's are sampled as model parameters. (For the models I am working with, the y_j's are independent, and each y_j is multivariate normal.) Do you have any suggestions? I was thinking I might use pen and paper math instead of extra parameters to integrate each multivariate normal likelihood p(y_j | \theta) over the missing elements, but I haven’t tried yet.

The mixture distribution can be highly multimodal. When you remove the non-missing information the missing value distribution changes a lot, and the each such subject may create another mode in the mixture. The partial missing data presented as unknown parameters is also the likely reason why using the usual posterior as proposal doesn’t work well.

If you can integrate something out analytically or with quadrature, that is likely to help.

Yeah, integrating out is extremely helpful. Luca Silva recommended I sample from p(\theta | y) \left ( \sum_{j = 1}^n p(y_j^\text{obs} | y_j^\text{mis}, \theta)^{-1} \right ), where y_j^\text{obs} and y_j^\text{mis} are the observed and missing slices of y_j, respectively. Multivariate normal conditionals look messy and inefficient to code directly in Stan. (I would rather not deal with that much matrix slicing and multiplication.) But the joint and marginal look super easy and efficient. I am already calculating the joint density in efficient vectorized fashion modeling missing observations as parameters, and the marginal is easy to get from a single easy linear transformation. So I can get the desired log conditional \log p(y^\text{obs}_j | y^\text{mis}_j, \theta) with just \log p(y^\text{obs}_j, y^\text{mis}_j | \theta) - \log p(y^\text{mis}_j | \theta). This workaround seems to perform really well in the bivariate case, and I am eager to try it out on the more complicated longitudinal models I have in mind. Thank you again for all your help (and an appreciative shoutout to Luca Silva for helping me in a separate conversation)!

2 Likes

Update for completeness: I am finding it much more computationally efficient, reliable, robust, and simple to validate when I implicitly integrate out all the y^\text{mis}_j's and just work with the marginal p(y^\text{obs}_j | \theta) for both the target posterior and q_\text{mix}(\theta). The technique at 3.5 Missing multivariate data | Stan User’s Guide is easy to generalize to the n-dimensional multivariate normal case, and it is > 3.5x faster than the technique at 3.1 Missing Data | Stan User’s Guide despite the fact that the latter reduces the number of Cholesky factorizations and lets me vectorize the likelihood calculation, while the former does not.

I am pretty sure the models work now, they seem to perform great on real-world data. I just need to run a calibration study to confirm (update: the models indeed do a good job recapturing parameters simulated from the prior). Thanks again for all your help!

2 Likes