Diagnostics for Mixture IS leave-one-out cross-validation?

Hi all,

I’m currently trying to implement some model comparison metrics for a high-dimensional model.

I don’t have a MWE to share, but there’s a writeup of a previous implementation of the model here: Timescales of influenza A/H3N2 antibody dynamics. In short, there are a small number of continuous-valued parameters (~10) alongside a large number of discrete binary parameters (~3000), with these sampled using a Gibbs sampler. These are being fit to 12,000 or so observations across 70 individuals (antibody measurements).

Given the model takes a fair amount of time to fit, brute-force LOO-CV is not possible, so using an approximation seems to me like it would be very helpful. However, PSIS seems to not be suitable either given the poor distribution of the k parameter:

Reading through the CV-FAQ, the Mixture IS approach seems to be a good solution, and I have been able to implement it rather easily.

However, this approach doesn’t seem to come with any comparable diagnostics so I’m not sure how to assess the validity of the outputs. It seems to behave well qualitatively (e.g. will report worse scores for chains that have not converged, or models where I have removed some important component) but otherwise seems a bit opaque.

Would it be valid, e.g., to compare the ELPD estimates produced by PSIS against those produced by Mixture IS?

1 Like

Great to hear the vignette has been useful. Mixture IS diagnostics is on my TODO list, but with limited time and need for some research it has not yet happened. Hopefully we get some progress in fall.

If the approaches agree then it’s likely they both are working well, but as you said they disagree the comparison is not that helpful.

You can still use Pareto-k diagnostic also it is likely to be pessimistic (ie estimating too high k’s) and in case of high Pareto-k’s the error is different than in PSIS-LOO.

You can also use the importance weighted variance estimate (eq 5 in PSIS paer) to estimate MCSE. If the observed importance ratios have a nasty distribution this may underestimate the MCSE, but it is likely to be useful anyway (by construction the Mixture-IS ratios have finite variance, but the distribution can still be nasty in a way of requiring very high sample size).

The best way to diagnose is to compare estimates from independent MCMC runs (for the mixture distribution). Even 10 repetitions can provide useful information about the variability. In the end, you can then combine all the draws for even more accurate estimate.

1 Like