Pareto values versus ELPD differences

I am comparing two models with slightly different parameterizations using loo and am wondering how to interpret the results when the model with higher ELPD also has more elevated Pareto k values.

Output of model 'Model_1':

Computed from 4000 by 6299 log-likelihood matrix

         Estimate    SE
elpd_loo -14585.8  84.9
p_loo      2215.2  29.9
looic     29171.6 169.8
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     6231  98.9%   41        
 (0.5, 0.7]   (ok)         63   1.0%   27        
   (0.7, 1]   (bad)         5   0.1%   30        
   (1, Inf)   (very bad)    0   0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

Output of model 'Model_2':

Computed from 4000 by 6299 log-likelihood matrix

         Estimate    SE
elpd_loo -14131.4  87.7
p_loo      2740.7  32.6
looic     28262.8 175.5
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     6113  97.0%   69        
 (0.5, 0.7]   (ok)        171   2.7%   33        
   (0.7, 1]   (bad)        14   0.2%   25        
   (1, Inf)   (very bad)    1   0.0%   23        
See help('pareto-k-diagnostic') for details.

Model comparisons:
                elpd_diff se_diff
Model_2    0.0       0.0 
Model_1 -454.4      39.7 
Warning messages:
1: Found 5 observations with a pareto_k > 0.7 in model 'Model_1'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations.  
2: Found 15 observations with a pareto_k > 0.7 in model 'Model_2'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations. 

Does this suggest both models are mis-specified and neither should be used? Or is Model_2 still a better fit despite the high k’s?

@avehtari could provide a better answer than I, but in general high Pareto k values in either model mean that you shouldn’t trust LOO’s model comparison. But it is possible for either or both models to be properly specified despite high Pareto k values. You might find this post helpful:

1 Like

Thank you!

Is ‘length(get_variables(model))’ ( from tidybayes) a reasonable approach to estimating the number of parameters?

I don’t know tidybayes well enough to answer authoritatively. If you are saving any transformed parameters or generated quantities in your model, make sure not to count them.

1 Like

This post is also now part of the loo documentation LOO package glossary — loo-glossary • loo

Without not yet knowing the number of parameters, just knowing that p_loo is 35-43% from the number of observations we can infer that it’s likely that the model is flexible. If p_loo is higher than the number of parameters then the model is certainly misspecified, but if p_loo is lower than the number of parameters I would expect that both models are flexible, which could mean, e.g. a hierarchical model with group specific parameters but not many observations for some groups. If p_loo is less than the number of the parameters and other model checking diagnostics don’t indicate bad model misspecification then the difference between the models is that big that you can say that the model 2 has better predictive performance.

Check also my yesterday post Model selection of nonlinear flexible hierarchical model with loo - #2 by avehtari

1 Like

Thank you both, @avehtari and @jsocolar !

Using the output from get_variables, it looks like the models are estimating ~31,000 parameters.

They are hierarchical models of 250 subjects, over time, with several categorical and continuous predictors per-side (left and right), with varying slopes and intercepts as well as nu and sigma parameters of a student_t distribution.