Up-to-date advice on LOO and high diagnostic values


I’m after a relatively jargon-free guide for what to do about LOO warnings about pareto diagnostic values.

Sorry if this is all covered elsewhere (if so… I would very much appreciate a link!)

For my specific case, I have pretty good results overall:

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     2490  99.5%   472       
 (0.5, 0.7]   (ok)          9   0.4%   163       
   (0.7, 1]   (bad)         4   0.2%   94        
 (1, Inf)   (very bad)    0   0.0%   <NA>

I have extracted the pareto statistics and plotted my data with the high-pareto points highlights. I found this quite informative and I can see that they are outliers and as such will lead to problems.

Is there an agreed-upon way to proceed?

I have seen some talk of moment matching, although that appears to be a bit fiddly to implement when using cmdstanr?

Is it valid to remove these points as outliers and refit the model? This seems questionable. I am hoping to use LOO model comparison on three different models and I assume they will all have slightly different high pareto points.

I could try improving my model, but I don’t really have any good ideas about how to proceed in this case. Perhaps using a distribution with heavier tails might make these outlier points less influential. Or a mixture model that allows some small % chance of an outlier to be generated (this feels like a bit increase in complexity!)


I’m no loo expert, but with 99.5% good Pareto k diagnostic values, I wouldn’t worry about those bad ones.

This is very much my current approach.

is there anything I can cite to back this up?

Have you seen Cross-validation FAQ? Happy to get feedback whether you think it’s a relatively jargon-full or -free.

Agreed-uopn way is to proceed with care

Only if you can verify that the values for these observations are clearly misrecorded and you think there is no need to model the process that sometimes causes erranous values. For example, if you would have data with human ages above 120 years, or human heights over 3m, you probably don’t have a good model for the process causing these or it is not interesting to model those.

If the values are rare but feasible, then it would be better to include them, and either improve the model or improve the LOO-CV computation.

It would help if you would show the complete LOO output, as you have now removed the information on how many posterior draws did you use for computation (and some other useful information). The new diagnostic recommendations in Pareto Smoothed Importance Sampling paper and implemented in the latest loo package version (and you have the older version now) don’t have anymore that ok range, and the threshold for bad depends on the number of posterior draws. If you did you have more than 2200 posterior draws, then the threshold is 0.7. If the 4 khat values are just slighlty larger than 0.7, you can try also sampling more posterior draws as the khat’s are estimated from finite number of draws, and the estimates can be just by chance over the threshold. Also if the elpd_loo doesn’t change when you increase the number of posterior draws, it is more likely that the estimates are stable. If the 4 khat values above 0.7 are close to 1, you need to worry more when comparing models with small difference in elpd_loo. If you show the results from the model comparison, I can also comment on what I think is small or big difference (discussed also in CV-FAQ).

If you tell what your current model and data are, we might be able to provide suggestions.

There is a reason to worry if those .5% values are close to 1. If those .5% values are close to 0.7 it is possible that the high values are just due to Monte Carlo error and it’s more likely that improved computation doesn’t change anything. There can’t be a simple rule and threshold for khat and percentage of bad values, and some care and thought is needed.

CV-FAQ and references mentioned there are the best information I know. I’m happy to get pointers to possible papers I have missed there.

Hello and thank you for the excellent reply.

I think the issue of “too technical” depends very much on the target audience. I find myself often in a limbo land - my background is in mathematics, but my research is in psychology and most psychologists aren’t very technical. I am currently trying to tackle some reviewer comments about a methods paper that I’ve written. I really want to avoid over complicating the paper, as that will confuse the main message.

The main message: many experiments in psychology make use of repeated trials over time. Most people assume these are independent repetitions, but this is clearly not generally the case. If we use a relatively simple asymptotic regression model, we can provide a better overview of the data. Depending on the research question, the model’s estimated asymptote may well be a more theoretically interesting quantity than the empirical mean.

I have fit these models using Stan, but I assume I could use a frequentist method too.

The current version of my model is:

// multi-level asymptotic regression


  real compute_mu_asym(
    real t, real t0, int kf, int kr, int ll,
    real a, real b, real r, matrix u) {

    // this function computes mu.
    real mu;

    // used to store fixed + random effects
    real A, B, Rho; 

    A =             a + u[3*(kr-1) + 1, ll];
    B =             b + u[3*(kr-1) + 2, ll];
    Rho = inv_logit(r + u[3*(kr-1) + 3, ll]);

    mu =  A - (A - B) .* (1-Rho)^(t./t0);



data {
  int<lower=0> N; // total number of data points
  array[N] real t; // trial number
  array[N] real y; // dep. var. 
  // condition information
  int<lower = 1> Kf; // number of conditions (fixed)
  array[N] int<lower = 1, upper = Kf> Xf; // condition info (fixed)
  int<lower = 1> Kr; // number of conditions (random)
  array[N] int<lower = 1, upper = Kr> Xr; // condition info (random)
  int<lower = 0> J; // number of participants
  array[N] int<lower = 1, upper = J> L; // participant id
  // now input priors 
  real pr_a_mu;
  real<lower=0> pr_a_sd;
  real pr_b_mu;
  real<lower=0> pr_b_sd;
  real<lower=0> pr_r_sd;
  real<lower=0> pr_sig_lambda; // sd of residual variance
  real<lower=0> pr_sig_u_lambda; // sd of random effect variance
  // hyper param
  real<lower=0> t0;

parameters {
  // asymptote
  array[Kf] real a;
  // intial value
  array[Kf] real b;
  // rate of change
  array[Kf] real r;
  // standard deviation
  real<lower = 0> sigma;

  // random effects
  vector<lower=0>[3*Kr] sigma_u;
  // declare L_u to be the Choleski factor of a 3*Kx3*K correlation matrix
  cholesky_factor_corr[3*Kr] L_u;
  // random effect matrix
  matrix[3*Kr,J] z_u; 

transformed parameters{ 

  // fixed effect of rate of change (not actually used in model)
  array[Kf] real rho;
  rho = inv_logit(r);

  // this transform random effects so that they have the correlation
  // matrix specified by the correlation matrix above
  matrix[3*Kr, J] u;
  u = diag_pre_multiply(sigma_u, L_u) * z_u;


model {

  // define priors for fixed effects
  a ~ normal(pr_a_mu, pr_a_sd);
  b ~ normal(pr_b_mu, pr_b_sd);
  r ~ normal(0, pr_r_sd);
  // priors for random effects
  sigma_u ~ exponential(pr_sig_u_lambda);
  L_u ~ lkj_corr_cholesky(1.5); // LKJ prior for the correlation matrix
  to_vector(z_u) ~ normal(0,2);

  // residuals
  sigma ~ exponential(pr_sig_lambda);
  // fit model
  array[N] real mu; 
  for (ii in 1:N) {

    mu[ii] = compute_mu_asym(
      t[ii], t0, Xf[ii], Xr[ii], L[ii], 
      a[Xf[ii]], b[Xf[ii]], r[Xf[ii]], u);

  y ~ normal(mu, sigma);


generated quantities {
  real pr_a = normal_rng(pr_a_mu, pr_a_sd);
  real pr_b = normal_rng(pr_b_mu, pr_b_sd);
  real pr_r = normal_rng(0, pr_r_sd);

  vector[N] log_lik;

  for (ii in 1:N) {

    real mu;

    mu = compute_mu_asym(
      t[ii], t0, Xf[ii], Xr[ii], L[ii], 
      a[Xf[ii]], b[Xf[ii]], r[Xf[ii]], u);
    log_lik[ii] = normal_lpdf(y[ii] | mu, sigma);

I have updated the loo package, and my output now looks like:

Computed from 4000 by 12511 log-likelihood matrix.

         Estimate    SE
elpd_loo  -2112.7 105.9
p_loo       308.2   8.6
looic      4225.4 211.7
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.2, 1.5]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     12507 100.0%  69      
   (0.7, 1]   (bad)          4   0.0%  <NA>    
   (1, Inf)   (very bad)     0   0.0%  <NA>    

The four values listed as bad are 0.824 0.857 0.706 0.789

In one of the examples I am looking at these four points are all the initial trial. I can understand the intuition as to why they are problematic for estimating loo… if we removed them from the data set then the estimate of the “initial value” parameter (b) would likely change quite a bit.

(in the plot, each column is data from a different person. I have picked out the 4 problematic points in blue. Note that there are 47 participants in total in the dataset,

Do you know of any good examples of published papers in which these methods are being used? It would be valuable to see how other researchers are communicating all these methods and details to their communities.

I’m happy to share data and more of my processing code if that would be helpful.

If possible, I would love to make this paper a good example of “best practice” [at least, as much as my abilities will allow me!