A quick note what I infer from p_loo and Pareto k values

This a quick note on different cases why Pareto k values can be large. I’ll try to extend this to a longer explanation and a case study. Any feedback on this draft is appreciated

  • If all Pareto k small, model is likely to be ok (although there can be better models)
  • If high Pareto k values
    • If p_loo << the number of parameters p, then the model is likely to be misspecified. PPC is likely to detect the problem, too. Try using overdispersed model, or add more structural information (nonlinearity, mixture model, etc.).
    • If p_loo > the number of parameters p, then the model is likely to be badly misspecified. If the number of parameters p<<n, then PPC is likely to detect the problem, too. Case example https://rawgit.com/avehtari/modelselection_tutorial/master/roaches.html
    • If p_loo > the number of parameters p, then the model is likely to be badly misspecified. If the number of parameters p is relatively large compared to the number of observations p>n/5 (more accurately we should count number of observations influencing each parameter as in hierarchical models some groups may have small n and some groups large n), it is possible that PPC doesn’t detect the problem. Case example Recommendations for what to do when k exceeds 0.5 in the loo package?
    • If p_loo < the number of parameters p and the number of parameters p is relatively large compared to the number of observations p>n/5, it is likely that model is so flexible or population prior is so weak that it’s difficult to predict for left out observation even if the model is true one. Case example is the simulated 8 schools in https://arxiv.org/abs/1507.04544 and Gaussian processes and spatial models with short correlation lengths.


A quick question: How do you define # of parameters in hierarchical models? Does the # of poarameters in this setting depend on a sparse (very few parameters due to pooling) and a dense (many parameters) data situation?

It would be nice to discuss this bit specifically in the case study given how ubiquotus hierarchical models are used.

1 Like

There are infinitely many different hierarchical models, so I don’t have a generic answer which would work always.

For most hierarchical models and high Pareto k values cases except for the last point about flexible well models, you can start by counting all parameters in the parameters block, and then refine if necessary.

For the more refined flexibility or weak prior case, different parameters can be influenced by different sets of observations directly from the likelihood or indirectly through joint prior. This would really require a longer explanation, but here are two short examples (quickly written in sloppy language):

  1. Simple hierarchical model with individuals, groups, and prior for groups. If prior for group parameters is weak, and some group has only one or a few individuals per local parameter, then each individual is highly influential and posterior with or without one observations can be so different that Pareto k will be high.
  2. In Gaussian latent variable models (e.g. GP) each observation has latent variable and the total number of parameters (latent parameters plus prior parameters) p>n. If (e.g. GP with exp_quad covf) prior is weak (short lengthscale), and specific observation is far away from the others then corresponding latent value is influenced mostly by that observation. On the other hand group of latent values close to each other correlate more and get information from other observations, too. Thus in some part of the covariate space loo may work well and in some other parts it may fail and we observe high k values.

If the above makes any sense, I’ll write it later with less sloppy language, and if it doesn’t make any sense I’ll try with more examples.

I was thinking three different models. Which model would you like to see? If you have a model with high Pareto k values, I can analyse it and add it to the case study.



I am mostly interested in the first case you describe as this is what I work a lot of the time with. Usually I am dealing with longitudinal models where I model data from individual subjects over time. Each subject has its own baseline value and very likely a different progression over time. So the simplest case is a varying intercept slope model.

What I also encounter from time to time is a mixed data situation in the following sense: Suppose I have trial 1 conducted such that I have few subjects, but very detailed information on them. Let’s say this trial informs well subject specific intercepts and slopes. Then at a later stage I conduct trial 2 where I have many more subjects, but collect only very little information beyond the baseline. Ideally I want to jointly model trial 1 and 2 data together, but the difficulty is that for trial 2 subjects I end up having very little information on the slope due to sparseness - still I have due to the larger sample size more information to squeeze out of the data in terms of how covariates may affect the population parameters.

This situation is a drastic simplification of what I have in mind. Still, any discussion along these lines would be great. Let me try to come up with a simulation example.

Another point to discuss in such a case study could be the break-down of loo and when one has to turn to cross-validation… which pulls in another risk which is the high variance of CV (as I understood your paper on assessing the different model selection techniques).


Then relevant questions is also that do you want to estimate the predictive performance for the next time point or for a new individual? In these cases loo is not the answer (although it can be a useful answer). It’s lear that we there is need to make more case studies.

Are trial 1 and trial 2 as important when you consider one value as summary of the predictive performance?

Any example I can work and make public would be great.

loo (and waic) and k-fold-cv have similar variance, loo (and waic) having slightly smaller variance for stable models (and Bayesian models are usually very stable compared to, e.g., maximum likelihood), so there is no need to avoid k-fold-cv if you are happy with loo. High variance of all loo, k-fold-cv, and information criterion is risk if you have a large number of models. If you have a small number of models and use the error estimates then it’s conservative (more likely to indicate no significant difference) which I think is not a risky behavior.

Hi Aki, these suggestions are really helpful! I have some further questions which I am not sure if I should be asking here but thought it is relevant to loo package…

  1. I am not sure how to refine the # of parameters, to compare p_loo and p (to diagnose the importance samplings)…?

  2. Also, from what I have in my model, p_loo < p (using # of parameters defined in the parameter block, is ~70 for total pooling and 467 for partial pooling model). But since p > 5/N = 800, so seems not belonging to any of the categories above.
    I am trying to compare two regression models: total pooling and partial pooling. Total pooling model has no ‘bad’ k, and partial pooling has several (<1%) ‘bad’ k (there are outliers in the data). 3934 samples with 33 covariates, partial pooling with 5 groups. Now for these two models, partial pooling has slightly (?) higher elpd but not sure if it makes any sense to compare them if one of the approximation is not trustable (see below).

The results of LOO:
1.Total pooling model:

> LOO_trial8_DE ##Total pooling
Computed from 2000 by 3934 log-likelihood matrix

         Estimate    SE
elpd_loo  -2772.0  69.2
p_loo        56.6   3.3
looic      5543.9 138.4

Pareto k diagnostic values:
                         Count  Pct 
(-Inf, 0.5]   (good)     3932  99.9%
 (0.5, 0.7]   (ok)          2   0.1%
   (0.7, 1]   (bad)         0   0.0%
   (1, Inf)   (very bad)    0   0.0%

All Pareto k estimates are ok (k < 0.7)
See help('pareto-k-diagnostic') for details.

2.Partial poling model:

> LOO_trial8_DE_hier ## partial pooling
Computed from 2000 by 3934 log-likelihood matrix

         Estimate    SE
elpd_loo  -2737.6  69.5
p_loo       181.0  10.7
looic      5475.3 139.0

Pareto k diagnostic values:
                         Count  Pct 
(-Inf, 0.5]   (good)     3881  98.7%
 (0.5, 0.7]   (ok)         38   1.0%
   (0.7, 1]   (bad)        14   0.4%
   (1, Inf)   (very bad)    1   0.0%
See help('pareto-k-diagnostic') for details.

3.Compare two models:

> loo::compare(LOO_trial8_DE, LOO_trial8_DE_hier)
elpd_diff        se 
     34.3      16.1 

Any insights/suggestion would be appreciated, thank you!

Question and LOO results are quite clear. Total pooling result is fine with small k’s. Partial pooling model is problematic beause of high k values. As p_loo=181 is quite large compared to p=467 we can assume that the model is flexible and it is just difficult to predict some left out observations and full posterior and loo-posterior are too different for PSIS-LOO. Are there a different number of observations in different groups? With that many k’s elpd_loo is likely to be optimistic, and thus the true difference is likely to be even smaller and since in comparison the difference is not significant now it’s likely that better computation doesn’t change the result.

I recommend also to make the posterior predictive checking and PIT-LOO (see, e.g. https://arxiv.org/abs/1709.01449) so that you can be more confident that high k’s for hierarchical model are just due to having so flexible model that PSIS-LOO starts to fail (and waic would fail, too)

Thank you for suggestions !

The data has 5 groups in total, with sample size ranging from 552 to 1005 for each (552, 624, 770, 983, 1005).

There are indeed some outliers, which I would expect that they are hard to be predicted from the rest of the data. Not sure why they are so influential for the posterior though. Would you suggest using more informative priors (only for partial-pooling model)?

Thank you for suggestions ! The posterior predictive check looks okay (I think): some outliers were not predicted accurately; no significant difference was observed between partial pooling and total pooling (although it is hard to tell from visualization of ~4000 data points… that’s why I am turning into LOO-CV to quantify).

The visualization paper was very helpful! I am new to the PIT-LOO, and not totally sure if I am doing it correctly but my plot looks like this (top two are total pooling, bottom two are partial pooling):

and pointwise differences in ELPD:

and k_hat (left: total pooling, right: partial pooling; the last group, 3300-th to 4000-th, is more heterogeneous):

Any insights would be appreciated. Thank you so much again!

(On the other hand, I am not 100% sure if the hierarchical model I am specifying is correct. I might start another thread for this model specification soon… here)

With that many observations some large khats can be also due to Monte Carlo variability. I noticed that you had used just 2000 draws and you don’t mention about the effective sample size. You could try with more chains and/or longer chains.


Oh, these look really bad! If you have made them correctly, then the conditional distribution has too short tails and the right hand tail is even shorter.

Thank you! I fit the same model for a simulation data, which is less noisy and the PPC and posterior estimates looks okay. But I am still getting a short right hand tail:

I tried to look up the description of PIT but am still not sure what I am doing wrong with the model, and how improve it. If I am understanding it correctly, short tail on the right side indicates that even the sample (i) with highest P(y_i | y_{-i}) is still not predicted correctly from this model (well, this doesn’t sound right for me…)??
Thank you for the help in advance!

Can you post the model and the code for simulated data?

I am simulating data with many zeros (for both explanatory and response variables), and doing the regression for two parameters q and \mu .

rhs.stan (3.3 KB)
Data_generation.stan (1.1 KB)

modeling.R (2.4 KB)
PPC.pdf (11.7 KB)
PIT.pdf (5.4 KB)

Thank you so much again for the help!!!

Thanks for the code that clarified a lot. Discrete data is a bit more difficult for PIT. The current code is assuming continuous distribution. It works for count data with many values, but ZIP with many zeroes brings up the problem with count data. See Czado, Gneiting, and Held (2009). Predictive Model Assessment for Count Data. https://onlinelibrary.wiley.com/doi/full/10.1111/j.1541-0420.2009.01191.x (sorry paywalled)
Although with many zeros it’s likely that even the version for count data may have problems. In this case LOO-PPC would be easier to interpret than LOO-PIT.

@jonah I made an issue to rstantools. See the above paper p. 1255 “a nonrandomized yet uniform version of the PIT histogram”

1 Like

If i understood correctly, you mean that for count data the LOO-PIT may have problems?For instance, I use two models: 1)a diinflation at zero (somewhat the opposite of zero inflation) and 2) a truncated distribution [-3,3] +diinflation at zero, because I have count data in range [-3,3]-{0}. My output for loo and Waic for both models are these:

Diinflated at 0

    Computed from 6000 by 132 log-likelihood matrix

         Estimate   SE
elpd_loo   -247.0  8.3
p_loo        17.1  2.8
looic       494.0 16.7
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     129   97.7%   1455      
 (0.5, 0.7]   (ok)         2    1.5%   717       
   (0.7, 1]   (bad)        1    0.8%   29        
   (1, Inf)   (very bad)   0    0.0%   <NA>      
See help('pareto-k-diagnostic') for details.
Warning message:
Some Pareto k diagnostic values are too high. See help('pareto-k-diagnostic') for details.

Computed from 6000 by 132 log-likelihood matrix

      Estimate   SE
elpd_waic   -246.0  8.1
p_waic        16.1  2.4
waic         492.0 16.1
Warning messages:
1: 8 (6.1%) p_waic estimates greater than 0.4. We recommend trying loo instead. 
2: 8 (6.1%) p_waic estimates greater than 0.4. We recommend trying loo instead. 

Truncated at [-3,3]+Diinflated at 0

Computed from 6000 by 132 log-likelihood matrix

         Estimate   SE
elpd_loo   -190.0 10.5
p_loo        19.0  2.3
looic       380.0 20.9
Monte Carlo SE of elpd_loo is 0.1.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

Computed from 6000 by 132 log-likelihood matrix

          Estimate   SE
elpd_waic   -189.7 10.4
p_waic        18.7  2.3
waic         379.3 20.9
Warning messages:
1: 15 (11.4%) p_waic estimates greater than 0.4. We recommend trying loo instead. 
2: 15 (11.4%) p_waic estimates greater than 0.4. We recommend trying loo instead. 


According to this, I have implemented also the PPC-LOO which is easier to interpret (If I understood correctly) . For both models, the PPC-LOO are presented respectively for subset of 3/4 of total observations with prob=0.95 (95% posterior interval):

Diinflated at 0:

Truncated+Diinflated at 0:


Thanks for the results and plots. This will give me more motivation to fix LOO-PIT for discrete more quickly (pinging @jonah). Your LOO-PIT plots are wrong; and I know how we should fix this.

It’s not clear what your models are, but the truncated one is clearly better based on loo and the last plots. Without further details, it’s not clear if there could be a even better model.

1 Like

As to the number of parameters p, I am not sure what do you mean by “parameters block” (I am using brms instead of Stan). I can get a list of parameters by using the parnames() function in brms. Can this be considered as the number of parameters?

If yes, then I have a case where p_loo is around 800 and p is around 1200. If n represents the total number of data points, then n/5=47508. So it doesn’t fit any of your categories. In this case, how should I check my model to see if it is specified correctly or not?

This is the loo result:

Thank you!

1 Like

Unfortunately help text for parnames doesn’t tell whether it includes also transformed parameters. @paul.buerkner can you clarify?

Can you show also the brms formula you are using? Maybe you have a hierarchical model and the observations with high khats are in groups with only a few observations?

1 Like

Thanks for yoru reply! This is the command I used to fit the model:

There are 237541 data points in data2fit.all. The data points are from 157 genes and 1513 cells from 8 cell types. For this model, there are 157*8=1256 group levels. sc.prop is the only predictor in the model. For overdispersion and zero inflation parameter of the ZINB distribution, I chose them to be the same across all levels.

A few questions:

(1) It seems that each data point has a khat. Forgive my limited knowledge of how khat is estimated. Does high khat mean the model fits badly for the given data point? I assume “fit badly” here means the model is not able to generate a value that is similar to the observed value during posterior predictive check?
(2) In my case, the loo result shows that 15 observations with k greater than 0.7. It seems a small number compared to the total number of data points. Can I ignore this just by looking at the number of data points with khat > 0.7? Does a small proportion indicate a good fit?
(3) This may be a stupid question. When I try to fit a model using variational inference, I often get warinings of high Pareto khat. When I try to fit a model using MCMC sampling, I usually don’t see this information. I understand that loo() is a generic function on models fitted with any method. But can I ignore the loo() result if my model fitted by MCMC sampling show good convergence (no divergences)?

parnames contains a lot of things (including some generated quantities but excluding some parameters) and I wouldn’t use it to compare against p_loo.

It means that the observation is highly influential and leave-one-out-posterior when leaving that observation out is so much different than full posterior that importance sampling approximated LOO fails (and waic would fail, too). See Pareto smoothed importance sampling for more. Observation can be highly influential for different reasons, see LOO Glossary Pareto k estimates.

With khat >0.7 the error can be negligible or very large with some probability, but as you can’t know how large the error can be or what is that probability what to do depends on how are you going to use the model. I recommend to look at those 15 observations and check if you can see why they are highly influential. This will also improve your understanding of the phenomenon, data and model.

Small proportion means that small proportion is highly influential and there can be different reasons why observations are highly influential, see the link above.

That is likely as variational inference is less accurate than MCMC

The purpose of convergence diagnostics is different from the purpose of loo(). In addition of checking divergences, it is recommended to use also other convergence diagnostics (e.g. Rhat and ESS, and also look at the MCSE for quantities of interest). AIf the convergence diagnostics don’t alert about possible issues, then loo() can be used as part of model checking, assessment of predictive performance and model comparison.