Problems with model comparison via loo and kfold

Used Packages

Brms 2.14.8
loo 2.4.1

I want to compare two ordinal probit models (cumulative model with probit link). I encountered errors for the more flexible one and unfortunately wasn’t able to solve them by myself, maybe some here having some thoughts. All the high pareto k’ seems to be values that could be seen as “outliers” or specifically Responses that vary from the average Responses. Assume Participant x - he always(!) responds with 6 when he sees a previous learned word and with 1 when he sees a new word (previous not learned) and if this Participant says to just one new word “2” then this appears to be a high pareto k. Due to having a memory experiment, I would definetly expect that some of the values are more variable.

We have N = 24 Participants, doing the experiment 4 times ( depending on time and condition)

The first model is a more restriced one. Assuming the latent variable havinga variance of 1 in all groups (disc = 1)

First Model Brms Syntax
evsdt <- brm(
  bf(Response ~ 1 + item*condition*time + (1 + item * condition * time | ID_T1T2 )),
  data = data_hypnomemory, family = cumulative("probit"),
  iter = 4000, inits = 0, save_pars = save_pars(all= TRUE),
  control = list(adapt_delta = 0.95), file = "evsdt_file"

evsdt_loo <- loo(evsdt, moment_match = TRUE)

Computed from 8000 by 3840 log-likelihood matrix

         Estimate    SE
elpd_loo  -2634.0  70.7
p_loo       127.9   6.6
looic      5268.0 141.4
Monte Carlo SE of elpd_loo is 0.2.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     3825  99.6%   1259      
 (0.5, 0.7]   (ok)         15   0.4%   550       
   (0.7, 1]   (bad)         0   0.0%   <NA>      
   (1, Inf)   (very bad)    0   0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details

The second one allows the variance of the latent variable to vary between groups.

2. Model Brms Syntax
uvsdt <- brm(
  bf(Response ~ 1 + item * condition * time + (1 + item * condition * time | ID_T1T2 ),
     disc ~ 0 + old + old:condition2 + old:time2 + old:condition2:time2 + (0 + old + old:condition2 + old:time2 + old:condition2:time2 | ID_T1T2 )),
  data = data_hypnomemory, family = cumulative("probit"),
  iter = 4000, inits = 0, save_pars = save_pars(all = TRUE),
  control = list(adapt_delta = 0.99, max_treedepth = 15),cores =6,
  file = 'uvsdt_file'


uvsdt_loo <- loo(uvsdt,cores = 4)

We get the following Result:

Found 12 observations with a pareto_k > 0.7 in model 'uvsdt'. It is recommended to set 'moment_match = TRUE' in order to perform moment matching for problematic observations.

Computed from 8000 by 3840 log-likelihood matrix
Estimate    SE
elpd_loo  -2529.7  68.2
p_loo       157.8   8.3
looic      5059.4 136.5
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     3793  98.8%   646       
 (0.5, 0.7]   (ok)         35   0.9%   328       
   (0.7, 1]   (bad)        12   0.3%   63        
   (1, Inf)   (very bad)    0   0.0%   <NA>      
See help('pareto-k-diagnostic') for details.

uvsdt_loo <- loo(uvsdt,moment_match = TRUE)

I get the following Error

  Error in validate_ll(log_ratios) : All input values must be finite.
    Error: Moment matching failed. Perhaps you did not set 'save_pars = save_pars(all = TRUE)' when fitting your model?

I tried to use kfold with the model and then the following Error appeared:

[1] “Error in sampler$call_sampler(args_list[[i]]) : Initialization failed.”
[1] “error occurred during calling the sampler; sampling not done”
Start sampling
Error: The model does not contain posterior samples.

Maybe I can use the given information to decide which of the two models is better?
I have an hierarchical model so I dont know how many parameters I do have?

(Intercept[1;2;3;4;5]+ item+ condition + time + item:condition + condition:time + item:condition:time + disc_item + disc_item:condition + disc_item:time + disc_item:condition:time)*25? Because N = 24 and the population level effects?

Hopefully someone takes the time to read this and replies.
With kind regards,
Dominic :)

With the loo_compare I get the following result

       elpd_diff se_diff
uvsdt2    0.0       0.0 
evsdt  -103.7      15.6 

I have read your Papers @avehtari and it is difficult to interprete the result because of the pareto k warning of the uvsdt2 model. I would argue the uvsdt2 model is the one with better predicting performance and the high pareto k’s are due to outliers, I would not expect to be predicted well when left out.
I tested the pp_checks for the hole Y-values and the means of the certain conditions and for all the Subjects and all seemed like the model is not misspecified.
I tried to plot LOO-PIT but due to having to continous outcome varibale, i stopped analyzing further.

Both models are nested. The EVSDT has all disc-parameters of the latent variable fixed to 1.

I look forward to your ideas and implications.


even though I believe that you will find that uvsdt2 will be the better model, relatively speaking, it’s still annoying that you can’t run with moment_match = TRUE. Could it be that you are stumbling over the same error as here?


Yeah I had also read the post and thought the same! I tried to refit the model several times, but moment_match = TRUE was unfortunately still only available for the evsdt model.

Perhaps @paul.buerkner can weigh in here :)

1 Like

Can you try reloo = TRUE instead of moment_match? It will require 12 refits but will allow for an accurate loo_compare to be done. (Just to note that I have not managed to get moment_match in brms to work on any model I have tested it on so far, so I always use reloo, if needed).


Since the difference is that big compared to se_diff, it is very unlikely that the error due to 12 most influential observations (that are too difficult for importance sampling as indicated by khat) would affect the order of the comparison. Although I think here the difference is likely to be big enough, unfortunately there is no generic rule how big the error can be for those pointwise elpds that have high khats.


I tried it once to use reloo = TRUE unfortunately it didn’t changed the high k-values.
Tried it again now with the same result, maybe the influential observation can’t be resolved.

Many thanks for your answers so far!

If you have performed reloo, you can trust the results of loo_compare. The high pareto_k values are still reported but they no longer affect the leave-one-out cross-validated comparison.


@andymilne I did not know that! Thank you for the hint. I will research on this feature!

From the brms help:

Warnings about Pareto k estimates indicate observations for which the approximation to LOO is problematic (this is described in detail in Vehtari, Gelman, and Gabry (2017) and the loo package documentation). If there are J observations with k estimates above k_threshold , then reloo will refit the original model J times, each time leaving out one of the J problematic observations. The pointwise contributions of these observations to the total ELPD are then computed directly and substituted for the previous estimates from these J observations that are stored in the original loo object.