Up-to-date advice on LOO and high diagnostic values

Alasdair_Clarke · May 16, 2024, 8:57am

Hello,

I’m after a relatively jargon-free guide for what to do about LOO warnings about pareto diagnostic values.

Sorry if this is all covered elsewhere (if so… I would very much appreciate a link!)

For my specific case, I have pretty good results overall:

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     2490  99.5%   472       
 (0.5, 0.7]   (ok)          9   0.4%   163       
   (0.7, 1]   (bad)         4   0.2%   94        
 (1, Inf)   (very bad)    0   0.0%   <NA>

I have extracted the pareto statistics and plotted my data with the high-pareto points highlights. I found this quite informative and I can see that they are outliers and as such will lead to problems.

Is there an agreed-upon way to proceed?

I have seen some talk of moment matching, although that appears to be a bit fiddly to implement when using cmdstanr?

Is it valid to remove these points as outliers and refit the model? This seems questionable. I am hoping to use LOO model comparison on three different models and I assume they will all have slightly different high pareto points.

I could try improving my model, but I don’t really have any good ideas about how to proceed in this case. Perhaps using a distribution with heavier tails might make these outlier points less influential. Or a mixture model that allows some small % chance of an outlier to be generated (this feels like a bit increase in complexity!)

Thanks

mhollanders · May 16, 2024, 11:13am

I’m no loo expert, but with 99.5% good Pareto k diagnostic values, I wouldn’t worry about those bad ones.

Alasdair_Clarke · May 16, 2024, 11:16am

This is very much my current approach.

is there anything I can cite to back this up?

avehtari · May 17, 2024, 7:29am

Have you seen Cross-validation FAQ? Happy to get feedback whether you think it’s a relatively jargon-full or -free.

Agreed-uopn way is to proceed with care

Only if you can verify that the values for these observations are clearly misrecorded and you think there is no need to model the process that sometimes causes erranous values. For example, if you would have data with human ages above 120 years, or human heights over 3m, you probably don’t have a good model for the process causing these or it is not interesting to model those.

If the values are rare but feasible, then it would be better to include them, and either improve the model or improve the LOO-CV computation.

It would help if you would show the complete LOO output, as you have now removed the information on how many posterior draws did you use for computation (and some other useful information). The new diagnostic recommendations in Pareto Smoothed Importance Sampling paper and implemented in the latest loo package version (and you have the older version now) don’t have anymore that ok range, and the threshold for bad depends on the number of posterior draws. If you did you have more than 2200 posterior draws, then the threshold is 0.7. If the 4 khat values are just slighlty larger than 0.7, you can try also sampling more posterior draws as the khat’s are estimated from finite number of draws, and the estimates can be just by chance over the threshold. Also if the elpd_loo doesn’t change when you increase the number of posterior draws, it is more likely that the estimates are stable. If the 4 khat values above 0.7 are close to 1, you need to worry more when comparing models with small difference in elpd_loo. If you show the results from the model comparison, I can also comment on what I think is small or big difference (discussed also in CV-FAQ).

If you tell what your current model and data are, we might be able to provide suggestions.

There is a reason to worry if those .5% values are close to 1. If those .5% values are close to 0.7 it is possible that the high values are just due to Monte Carlo error and it’s more likely that improved computation doesn’t change anything. There can’t be a simple rule and threshold for khat and percentage of bad values, and some care and thought is needed.

CV-FAQ and references mentioned there are the best information I know. I’m happy to get pointers to possible papers I have missed there.

Alasdair_Clarke · May 17, 2024, 9:06am

Hello and thank you for the excellent reply.

I think the issue of “too technical” depends very much on the target audience. I find myself often in a limbo land - my background is in mathematics, but my research is in psychology and most psychologists aren’t very technical. I am currently trying to tackle some reviewer comments about a methods paper that I’ve written. I really want to avoid over complicating the paper, as that will confuse the main message.

The main message: many experiments in psychology make use of repeated trials over time. Most people assume these are independent repetitions, but this is clearly not generally the case. If we use a relatively simple asymptotic regression model, we can provide a better overview of the data. Depending on the research question, the model’s estimated asymptote may well be a more theoretically interesting quantity than the empirical mean.

I have fit these models using Stan, but I assume I could use a frequentist method too.

The current version of my model is:

// multi-level asymptotic regression

functions{

  real compute_mu_asym(
    real t, real t0, int kf, int kr, int ll,
    real a, real b, real r, matrix u) {

    // this function computes mu.
    real mu;

    // used to store fixed + random effects
    real A, B, Rho; 

    A =             a + u[3*(kr-1) + 1, ll];
    B =             b + u[3*(kr-1) + 2, ll];
    Rho = inv_logit(r + u[3*(kr-1) + 3, ll]);

    mu =  A - (A - B) .* (1-Rho)^(t./t0);

    return(mu);

  }
}

data {
  int<lower=0> N; // total number of data points
  array[N] real t; // trial number
  array[N] real y; // dep. var. 
  // condition information
  int<lower = 1> Kf; // number of conditions (fixed)
  array[N] int<lower = 1, upper = Kf> Xf; // condition info (fixed)
  int<lower = 1> Kr; // number of conditions (random)
  array[N] int<lower = 1, upper = Kr> Xr; // condition info (random)
  int<lower = 0> J; // number of participants
  array[N] int<lower = 1, upper = J> L; // participant id
  // now input priors 
  real pr_a_mu;
  real<lower=0> pr_a_sd;
  real pr_b_mu;
  real<lower=0> pr_b_sd;
  real<lower=0> pr_r_sd;
  real<lower=0> pr_sig_lambda; // sd of residual variance
  real<lower=0> pr_sig_u_lambda; // sd of random effect variance
  // hyper param
  real<lower=0> t0;
}

parameters {
  
  // asymptote
  array[Kf] real a;
  // intial value
  array[Kf] real b;
  // rate of change
  array[Kf] real r;
  // standard deviation
  real<lower = 0> sigma;

  // random effects
  vector<lower=0>[3*Kr] sigma_u;
  // declare L_u to be the Choleski factor of a 3*Kx3*K correlation matrix
  cholesky_factor_corr[3*Kr] L_u;
  // random effect matrix
  matrix[3*Kr,J] z_u; 
}

transformed parameters{ 

  // fixed effect of rate of change (not actually used in model)
  array[Kf] real rho;
  rho = inv_logit(r);

  // this transform random effects so that they have the correlation
  // matrix specified by the correlation matrix above
  matrix[3*Kr, J] u;
  u = diag_pre_multiply(sigma_u, L_u) * z_u;

}

model {

  // define priors for fixed effects
  a ~ normal(pr_a_mu, pr_a_sd);
  b ~ normal(pr_b_mu, pr_b_sd);
  r ~ normal(0, pr_r_sd);
  
  // priors for random effects
  sigma_u ~ exponential(pr_sig_u_lambda);
  L_u ~ lkj_corr_cholesky(1.5); // LKJ prior for the correlation matrix
  to_vector(z_u) ~ normal(0,2);

  // residuals
  sigma ~ exponential(pr_sig_lambda);
  
  /////////////////////////
  // fit model
  /////////////////////////
  array[N] real mu; 
 
  for (ii in 1:N) {

    mu[ii] = compute_mu_asym(
      t[ii], t0, Xf[ii], Xr[ii], L[ii], 
      a[Xf[ii]], b[Xf[ii]], r[Xf[ii]], u);
  
  }

  y ~ normal(mu, sigma);

}

generated quantities {
  
  real pr_a = normal_rng(pr_a_mu, pr_a_sd);
  real pr_b = normal_rng(pr_b_mu, pr_b_sd);
  real pr_r = normal_rng(0, pr_r_sd);

  vector[N] log_lik;

  for (ii in 1:N) {

    real mu;

    mu = compute_mu_asym(
      t[ii], t0, Xf[ii], Xr[ii], L[ii], 
      a[Xf[ii]], b[Xf[ii]], r[Xf[ii]], u);
  
    log_lik[ii] = normal_lpdf(y[ii] | mu, sigma);
    
  } 
}

I have updated the loo package, and my output now looks like:

Computed from 4000 by 12511 log-likelihood matrix.

         Estimate    SE
elpd_loo  -2112.7 105.9
p_loo       308.2   8.6
looic      4225.4 211.7
------
MCSE of elpd_loo is NA.
MCSE and ESS estimates assume MCMC draws (r_eff in [0.2, 1.5]).

Pareto k diagnostic values:
                         Count Pct.    Min. ESS
(-Inf, 0.7]   (good)     12507 100.0%  69      
   (0.7, 1]   (bad)          4   0.0%  <NA>    
   (1, Inf)   (very bad)     0   0.0%  <NA>

The four values listed as bad are 0.824 0.857 0.706 0.789

In one of the examples I am looking at these four points are all the initial trial. I can understand the intuition as to why they are problematic for estimating loo… if we removed them from the data set then the estimate of the “initial value” parameter (b) would likely change quite a bit.

(in the plot, each column is data from a different person. I have picked out the 4 problematic points in blue. Note that there are 47 participants in total in the dataset,

Do you know of any good examples of published papers in which these methods are being used? It would be valuable to see how other researchers are communicating all these methods and details to their communities.

I’m happy to share data and more of my processing code if that would be helpful.

If possible, I would love to make this paper a good example of “best practice” [at least, as much as my abilities will allow me!

Alasdair_Clarke · May 21, 2024, 8:59am

Hello,

does anybody have any further advice?

I’m hoping to get this project wrapped up soon and everything appears to be so tantelising close to working!

From reading up on this some more, i could try moment-matching (I’ve had not had luck getting this to work with cmdstanr, but that is likely due to my own incompetance!).

Another option is to do 10-fold validation?

Thanks

(I am always impressed with how helpful this community is)

avehtari · May 22, 2024, 6:41pm

.706 could by chance due to Monte Carlo variation, others less likely

That’s a good analysis of the situation.

And nice plot. Based on these, if it’s not a problem that the first trial looks quite different then several following ones and that feature is not modelled well then, I think it is fine to continue with these 4, unless you are worried that they influence elpd, too much.

Can you clarify what you mean by “these methods”? I know many papers using PSIS-LOO, many mentioning Pareto k diagnostics, many using moment matching, reloo, or K-fold-CV to verify that high k’s are not a problem, but I don’t know papers discussing why proceeding with a few high k values is ok. Your explanation and plot would be excellent in a paper for justifying why these few high k values are not a problem, and that these all are initial trials which seem to behave differently anyway.

I think you are on a good track, and if you like I can have a quick look and provide comments before you submit.

Sorry to keep you in suspense while being busy with other things. I hope my answer didn’t come too late

Third option would be to remove these 4 observations, fit with everything else, and then compute the log predictive density for these 4 left out observations, and if the values would match with PSIS-LOO, then you would not need to worry. I think this would be probably the simplest option. And you could still tell about these 4 in the same way as above. See Holdout validation and K-fold cross-validation of Stan programs with the loo package • loo

Alasdair_Clarke · May 23, 2024, 8:17am

Thank you for taking the time to respond.

I will certainly send a copy of the paper to you to have a quick look over before we resubmit.

A little bit of background: Our paper is a methods paper aiming at a mainstream (cognitive) psychology audience. The point we are making isn’t particularly novel: if you carry out an experiment with multiple trials per person, then these trials aren’t really independent. In a lot of cases, people’s performance improves over time. If we model this improvement (using a simple asymptotic regression method), we should end up with more meaningful measurements compared to just taking the mean.

The peer reviews have been overall very positive. The main thing I need to add on the modelling side is model comparison as the reviewers would like some quantitative stats to back out our qualitative claims.

I was hoping that LOO would “just work” and I fear that a long diversion into technical matters will distract from the central message of the paper. Most psychologists still stick to ANOVA + p-values and will be lost if there is a lengthy section on psis, pareto k statistics, etc (and, as I’m sure it is clear, my own understanding of some of these issues is a little murky!)

Topic		Replies	Views
Loo_compare in the presence of high pareto-k brms loo	6	406	June 29, 2025
Interpret pareto k diagnostic Modeling rstan , fitting-issues , loo	3	1726	August 3, 2023
Good PP check and R square but large Pareto k values Modeling performance , loo	10	2161	September 2, 2020
Pareto k warns of misspecified model on model recovery when computing PSIS-LOO-CV Modeling loo	18	903	August 10, 2022
A quick note what I infer from p_loo and Pareto k values Modeling loo	35	15951	August 21, 2022

Up-to-date advice on LOO and high diagnostic values

Related topics