Interpret pareto k diagnostic

I am new to stan and made 3 different stan models to fit to the same dataset of 1380 observations respectively and ranked them using loo::loo_compare(), but I am struggling with some diagnostics.

For two models I get quite small percentage of Pareto k values that are not good (bad or very bad). I have read other posts on the matter and some documentation (LOO package glossary — loo-glossary • loo), but it seems very context/model specific from what I have seen/understood.

Can the models with bad/very bad Pareto k values still be used and compared to other models, or do they absolutely need to be changed?

All three models are basically the same model with increasing complexity (adding some parameters)

Model informations:
Model 1 (has 2 parameters)
model 1 loo outpout:
loo_1

Model 2 (has 157 parameters)
model 2 loo outpout:
loo_2

Model 3 (has 311 parameters)
model 3 loo output:
loo_3

Here is the stan code for model 3 as example:

data { 
  // Size integer
  int<lower = 1> n;
  int<lower = 1> n_unique; \\ 154 unique
  // Vector data
  vector[n] bf;
  vector[n] bp;
  vector[n] ap;
  int unique_id[n];
}

transformed data {
  vector[n] log_bf = log(bf);
  vector[n] log_bp = log(bp);
  vector[n] log_ap = log(ap);
}

parameters {
  real mu_alp;  \\ 1 parameter
  vector[n_unique] alp; \\ 154 unique alp for 154 parameters
  real<lower = 0> sd_alp; \\ 1 parameter
  real<lower = 0> sigma;  \\ 1 parameter
  vector[n_unique] h_j; \\ 154 unique h_j for 154 parameters
\\ total  of 311 parameters
}

model {
  vector[n] mu;
  vector[n] unique_factor;

  // Priors:
  mu_alp ~ normal(-5,2);
  alp ~ normal(mu_alp, sd_alp);
  sd_alp ~ exponential(3);
  sigma ~ exponential(5);
  h_j ~ normal(-4,2);

  // Computing 

   unique_factor = alp[unique_id] + log_ap;
   
   mu = unique_factor + log_bp - log1p_exp(h_j[unique_id] + unique_factor + log_bp);

  log_bf ~ normal(mu, sigma);

}

generated quantities {
    vector[n] unique_factor;
    vector[n] log_bf_hat;
    vector[n] log_lik;
    vector[n] y_rep;
    real<lower = 0, upper = 1> Rsq_3;


  unique_factor = alp[unique_id] + log_ap;

  log_bf_hat = unique_factor + log_bp - log1p_exp(h_j[unique_id] + unique_factor + log_bp);
      
  for (i in 1:n) {

    log_lik[i] = normal_lpdf(log_bf[i] | log_bf_hat[i], sigma);

    y_rep[i] = normal_rng(log_bf_hat[i], sigma);

  }

  Rsq_3 = variance(log_bf_hat) / (variance(log_bf_hat) + square(sigma));

}

@avehtari just posted his Cross-validation FAQ, which contains a section, what do I do if I have too many high Pareto-k values.

The problem is that the approximation behind LOO is poor in the case where the Pareto-k values are high. The FAQ gives some possible actions, but you can always just do the cross-validation manually (though leave one out is usually prohibitive).

Thank you for your reply @Bob_Carpenter !

I had seen the recommendations on what to do when there are too many high Pareto k values in other documentation, and to look into p_loo for further information on the underlying problem. I had not see the loo moment matching though and am now looking into it!

Something else I have not found is how many high Pareto k values is too many. Is there any “rule of thumb”, something related to the number of observation? Is only one high Pareto k value too much and model need reworking?

I can’t seem to find anything regarding actual threshold of high Pareto k values.

Thank you again!

As often the answer is that it depends. As discussed in FAQ, you may get high Pareto k values also when the model is correct and then model doesn’t need reworking, but a different CV-computation may be needed (as in Roaches cross-validation demo). Also as high Pareto k values indicate overoptimism, and if a model with many high k values is much worse than another model, there is no need to improve the computation for the worse (e.g. the binomial model has many high k values, but it’s so much worse and rejected also by PPC, so that there is no need to improve CV-computation). When high k values are due to misspecified model/outliers, it depends on the use case how important it would be to improve the model, as sometimes the conclusions would not change with better model. If in doubt, then just use K-fold-CV.