I am comparing two models with slightly different parameterizations using loo and am wondering how to interpret the results when the model with higher ELPD also has more elevated Pareto k values.
Output of model 'Model_1':
Computed from 4000 by 6299 log-likelihood matrix
Estimate SE
elpd_loo -14585.8 84.9
p_loo 2215.2 29.9
looic 29171.6 169.8
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 6231 98.9% 41
(0.5, 0.7] (ok) 63 1.0% 27
(0.7, 1] (bad) 5 0.1% 30
(1, Inf) (very bad) 0 0.0% <NA>
See help('pareto-k-diagnostic') for details.
Output of model 'Model_2':
Computed from 4000 by 6299 log-likelihood matrix
Estimate SE
elpd_loo -14131.4 87.7
p_loo 2740.7 32.6
looic 28262.8 175.5
------
Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(-Inf, 0.5] (good) 6113 97.0% 69
(0.5, 0.7] (ok) 171 2.7% 33
(0.7, 1] (bad) 14 0.2% 25
(1, Inf) (very bad) 1 0.0% 23
See help('pareto-k-diagnostic') for details.
Model comparisons:
elpd_diff se_diff
Model_2 0.0 0.0
Model_1 -454.4 39.7
Warning messages:
1: Found 5 observations with a pareto_k > 0.7 in model 'Model_1'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations.
2: Found 15 observations with a pareto_k > 0.7 in model 'Model_2'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations.
Does this suggest both models are mis-specified and neither should be used? Or is Model_2 still a better fit despite the high k’s?
@avehtari could provide a better answer than I, but in general high Pareto k values in either model mean that you shouldn’t trust LOO’s model comparison. But it is possible for either or both models to be properly specified despite high Pareto k values. You might find this post helpful:
I don’t know tidybayes well enough to answer authoritatively. If you are saving any transformed parameters or generated quantities in your model, make sure not to count them.
Without not yet knowing the number of parameters, just knowing that p_loo is 35-43% from the number of observations we can infer that it’s likely that the model is flexible. If p_loo is higher than the number of parameters then the model is certainly misspecified, but if p_loo is lower than the number of parameters I would expect that both models are flexible, which could mean, e.g. a hierarchical model with group specific parameters but not many observations for some groups. If p_loo is less than the number of the parameters and other model checking diagnostics don’t indicate bad model misspecification then the difference between the models is that big that you can say that the model 2 has better predictive performance.
Using the output from get_variables, it looks like the models are estimating ~31,000 parameters.
They are hierarchical models of 250 subjects, over time, with several categorical and continuous predictors per-side (left and right), with varying slopes and intercepts as well as nu and sigma parameters of a student_t distribution.