I fitted 4 different hierarchical models on a large dataset (~70 000 observations). I would like to compare the model performances using LOO-PSIS. I started with a baseline model where I add complexity to the other ones (polynomial terms + interactions).
Here are some specifications of the data:
- env_ID = factor with 27 levels (environments)
- ID = factor with 2171 levels (individuals)
Here is the structure of the 1st model. I want to investigate the relationship between different predator hunting behaviours and their hunting success.
# Set priors priors <- set_prior("normal(0, 5)", class = "b") # Model formula model_formula <- brmsformula(hunting_success | trials(4) ~ speed+ space + ambush + latency + (1 | env_ID) + (1 | ID) + (1 | obs)) # Baseline model system.time(base_model <- brm(formula = model_formula, family = binomial(link = "logit"), warmup = 3000, iter = 53000, thin = 50, chains = 4, inits = "0", threads = threading(10), backend = "cmdstanr", seed = 1234, prior = priors, save_pars = save_pars(all = TRUE), control = list(adapt_delta = 0.95), data = data))
The model runs without problems (and the other ones too). I used within-chain parallelization because the models take a very long time to run (on a remote computer cluster with 40 cores). The posterior predictive checks did not seem to indicate a problem with the model fit. However, when I run the loo function on my model, I get this :
> loo1 <- brms::loo(base_model1, "loo") Warning message: Found 9925 observations with a pareto_k > 0.7 in model 'base_model1'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. > loo1 Computed from 4000 by 70831 log-likelihood matrix Estimate SE elpd_loo -100031.1 132.2 p_loo 27778.4 86.1 looic 200062.2 264.4 ------ Monte Carlo SE of elpd_loo is NA. Pareto k diagnostic values: Count Pct. Min. n_eff (-Inf, 0.5] (good) 28137 39.7% 186 (0.5, 0.7] (ok) 32769 46.3% 72 (0.7, 1] (bad) 9818 13.9% 16 (1, Inf) (very bad) 107 0.2% 6 See help('pareto-k-diagnostic') for details.
I understand that the output indicates that the model might be misspecified. The problem is when I try to use moment_match = TRUE in the loo function, it returns :
Error: Backend ‘rstan’ is required for this method.
Does it mean I would have had to set the backend = “rstan” in the brm function? But then I could not use within-chain parallelization? I could not understand how the problem was fixed reading an answer provided by Paul Buerkner in this post https://discourse.mc-stan.org/t/difficulty-comparing-models-with-loo/17271/22
Unfortunately, I cannot provide the data to reproduce this example as it is private. The models were fitted with brms 2.15.0 under R 4.0.2 on CentOS Linux 7.
Lastly, from what I understood reading this post https://discourse.mc-stan.org/t/difficulty-comparing-models-with-loo/17271/22 and these posts https://avehtari.github.io/modelselection/rats_kcv.html, https://avehtari.github.io/modelselection/roaches.html, should I instead go with k-fold cross validation? Or maybe 14% of observations being problematic is not so important?
Thank you very much for your help.