Hi,
I fitted 4 different hierarchical models on a large dataset (~70 000 observations). I would like to compare the model performances using LOO-PSIS. I started with a baseline model where I add complexity to the other ones (polynomial terms + interactions).
Here are some specifications of the data:
- env_ID = factor with 27 levels (environments)
- ID = factor with 2171 levels (individuals)
Here is the structure of the 1st model. I want to investigate the relationship between different predator hunting behaviours and their hunting success.
# Set priors
priors <- set_prior("normal(0, 5)", class = "b")
# Model formula
model_formula <- brmsformula(hunting_success | trials(4) ~
speed+
space +
ambush +
latency +
(1 | env_ID) +
(1 | ID) +
(1 | obs))
# Baseline model
system.time(base_model <- brm(formula = model_formula,
family = binomial(link = "logit"),
warmup = 3000,
iter = 53000,
thin = 50,
chains = 4,
inits = "0",
threads = threading(10),
backend = "cmdstanr",
seed = 1234,
prior = priors,
save_pars = save_pars(all = TRUE),
control = list(adapt_delta = 0.95),
data = data))
The model runs without problems (and the other ones too). I used within-chain parallelization because the models take a very long time to run (on a remote computer cluster with 40 cores). The posterior predictive checks did not seem to indicate a problem with the model fit. However, when I run the loo function on my model, I get this :
> loo1 <- brms::loo(base_model1, "loo")
Warning message:
Found 9925 observations with a pareto_k > 0.7 in model 'base_model1'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations.
> loo1
Computed from 4000 by 70831 log-likelihood matrix
Estimate SE
elpd_loo -100031.1 132.2
p_loo 27778.4 86.1
looic 200062.2 264.4
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 28137 39.7% 186
(0.5, 0.7] (ok) 32769 46.3% 72
(0.7, 1] (bad) 9818 13.9% 16
(1, Inf) (very bad) 107 0.2% 6
See help('pareto-k-diagnostic') for details.
I understand that the output indicates that the model might be misspecified. The problem is when I try to use moment_match = TRUE in the loo function, it returns :
Error: Backend ‘rstan’ is required for this method.
Does it mean I would have had to set the backend = “rstan” in the brm function? But then I could not use within-chain parallelization? I could not understand how the problem was fixed reading an answer provided by Paul Buerkner in this post https://discourse.mc-stan.org/t/difficulty-comparing-models-with-loo/17271/22
Unfortunately, I cannot provide the data to reproduce this example as it is private. The models were fitted with brms 2.15.0 under R 4.0.2 on CentOS Linux 7.
Lastly, from what I understood reading this post https://discourse.mc-stan.org/t/difficulty-comparing-models-with-loo/17271/22 and these posts https://avehtari.github.io/modelselection/rats_kcv.html, https://avehtari.github.io/modelselection/roaches.html, should I instead go with k-fold cross validation? Or maybe 14% of observations being problematic is not so important?
Thank you very much for your help.