I have a model written in cmdstanr with C clusters. My aim is to fit the model for various values of C and select the number of clusters that maximises the model performance. For MCMC fitted models, I would lean towards using elpd (from the loo package), but since I am fitting a mixture model I decided to fit it via VI instead.
In my model, I specify:
- the base parameters of my model
- a transformed set of parameters
- generated quantities for post-posterior checks. This block also includes generated quantities of the log likelihood, labelled as
log_lik.
Following some searches online and reading this previous post I tried to approximate the loo posterior using the following snippet (with my VI fitted model being called fit, and the input data labelled data_list):
log_p <- fit$lp()
log_g <- fit$lp_approx()
loo_approximate_posterior(fit$draws('log_lik'),
draws = as_draws_matrix(fit$draws()),
data=data_list,
log_p = log_p,
log_g = log_g,
cores = 4)
The output for the function had attrocious Pareto k diagnostics, which surprised me as a glance at the PPC suggested that I got some decent results:
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 0 0.0% <NA>
(0.5, 0.7] (ok) 0 0.0% <NA>
(0.7, 1] (bad) 0 0.0% <NA>
(1, Inf) (very bad) 20570 100.0% 1
See help('pareto-k-diagnostic') for details.
Am I doing the right thing here? Does the llhood need to be calculated outside the draws object like the author of the linked post does? Is there an alternative approach that works?
Any help would be much appreciated!