Model Assessment for Variational Inference in `cmdstanr`

I have a model written in cmdstanr with C clusters. My aim is to fit the model for various values of C and select the number of clusters that maximises the model performance. For MCMC fitted models, I would lean towards using elpd (from the loo package), but since I am fitting a mixture model I decided to fit it via VI instead.

In my model, I specify:

  • the base parameters of my model
  • a transformed set of parameters
  • generated quantities for post-posterior checks. This block also includes generated quantities of the log likelihood, labelled as log_lik.

Following some searches online and reading this previous post I tried to approximate the loo posterior using the following snippet (with my VI fitted model being called fit, and the input data labelled data_list):

log_p <- fit$lp()
log_g <- fit$lp_approx()
loo_approximate_posterior(fit$draws('log_lik'),
                          draws = as_draws_matrix(fit$draws()),
                          data=data_list,
                          log_p = log_p,
                          log_g = log_g,
                          cores = 4)

The output for the function had attrocious Pareto k diagnostics, which surprised me as a glance at the PPC suggested that I got some decent results:

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)         0   0.0%  <NA>      
 (0.5, 0.7]   (ok)           0   0.0%  <NA>      
   (0.7, 1]   (bad)          0   0.0%  <NA>      
   (1, Inf)   (very bad) 20570 100.0%  1         
See help('pareto-k-diagnostic') for details.

Am I doing the right thing here? Does the llhood need to be calculated outside the draws object like the author of the linked post does? Is there an alternative approach that works?

Any help would be much appreciated!

All you need to run loo is the log likelihood calculation for the data items stored in a one-dimensional container log_lik (I’m not 100% sure if this has to be an array or if it can also be a vector). So you need to write the model that way.

High Pareto-K diagnostic values means that you are probably not getting well estimated integrals.

I would try to diagnose visually in a case where you know the answer. It can be very hard to estimate a mixture well if there are symmetries. For example, if there is a reflective symmetry that doesn’t change the density, you can approximate and find only a single mode and then all of our density estimates are half of what they should be. You see this in something like

y ~ normal(abs(mu), sigma);
mu ~ normal(0, 5);
sigma ~ lognormal(0, 1)

where if the true mean is 2.7, you’ll get modes in the density at mu = 2.7 and mu = -2.7, but sampling and VI is only likely to find one and you’ll get atrocious fits but perfectly reasonable predictions for simulated data because one mode covers all you need.