Posterior Predictives look good, but PSIS-LOO and WAIC are bad

I am working on a model with a Beta likelihood and varying intercepts with a non-centered parameterization, which I intend to use to extrapolate back in time, rather than forward prediction.

The posterior predictives look good, relatively narrow credible intervals and the median estimates align nicely with the observed data. But, almost all of the Pareto k estimates are over 0.7 and all WAIC values are over 0.4.

If I remove the varying intercepts, the credible intervals on the predictions increase a lot, essentially filling the response scale (0 : 0.8) and the median estimates are way off for a number of the groups. But, in this case both PSIS-LOO and WAIC suggest the model fits much better.

Is this more likely a model problem or an interpretation problem?

With bad Pareto k estimates, you cannot trust PSIS-LOO or WAIC to identify the model with better predictive performance. More details about your model and the output of LOO are necessary to advise further, except to say that you might be able to get robust LOO-CV by moment matching or (if the model doesn’t take too long to fit) by actually implementing a leave-one-out cross validation procedure. The simplest way to do this depends on what interface/package you are using to fit the model.

Finally, don’t confuse lack of misspecification (i.e. capturing the true generative process), predictive performance (ability to predict new or held-out observations), and pareto k.

  • Sometimes the correctly specified model (i.e. one that captures the true generative process) can yield worse predictive performance than a simpler model. Edit: this distinction can sometimes matter a lot, depending on whether your goal is prediction or accurate uncertainty quantification around particular covariate(s).
  • Bad Pareto k doesn’t necessarily mean that the model is bad, either in the sense of poor prediction or misspecification. In some scenarios it can be indicative of misspecification, but you need to consider more information to conclude whether that is the case (see here A quick note what I infer from p_loo and Pareto k values). The only thing that high Pareto k means in all situations is that you probably shouldn’t trust the PSIS-LOO or WAIC computations to correctly determine which model yields the best predictive performance, as assessed by genuine leave-one-out cross-validation.

@jsocolar’s answer is very good. For further comments, could you show the loo output for both models and also tell the number of parameters in each model? LOO package glossary — loo-glossary • loo gives some guidance, but if you provide that additional information we can help in the interpretation.


Thank you @jsocolar and @avehtari !

I’m using RStan for these models.

The full output for the model with good PPC’s is:

Computed from 8000 by 62 log-likelihood matrix

         Estimate  SE
elpd_loo    151.0 2.1
p_loo        79.9 7.3
looic      -302.1 4.2
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)      0     0.0%   <NA>      
 (0.5, 0.7]   (ok)        1     1.6%   399       
   (0.7, 1]   (bad)      42    67.7%   33        
   (1, Inf)   (very bad) 19    30.6%   6         
See help('pareto-k-diagnostic') for details.

My parameters are as follows:

parameters {
        real beta;  // Population-level effect
        real<lower=0> kappa;  // Precision parameter
        vector<lower=0>[1] sd_grp;  // Group standard deviation
        vector[62] mu_grp[1];  // Standardized group means
transformed parameters {
        grp_int = (sd_grp[1] * (mu_grp[1])); // Group intercepts

For the model with poor PPC:

Computed from 8000 by 62 log-likelihood matrix

         Estimate    SE
elpd_loo      0.3   0.1
p_loo      2019.0 300.9
looic        -0.6   0.2
Monte Carlo SE of elpd_loo is 0.2.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     55    88.7%   671       
 (0.5, 0.7]   (ok)        7    11.3%   1420      
   (0.7, 1]   (bad)       0     0.0%   <NA>      
   (1, Inf)   (very bad)  0     0.0%   <NA>      

All Pareto k estimates are ok (k < 0.7).
See help('pareto-k-diagnostic') for details.

The parameters for that one:

real alpha;
        real beta1;  // Population-level effect for group
        real beta2;  // Population-level effect for x
        real<lower=0> kappa;  // Precision parameter

So there are 62 observations, a separate parameter for ecah observation, plus some more, and p_loo>62. So the interpretation is that your model is too flexible to be able to use PSIS-LOO (or waic). As p_loo>62, it hints that the group means have so big variability that it is very difficult to predict anything useful for new (or left out) group. There is also a possibility that the model is misspecified and there are some serious outliers. You should check the distribution of means of mu_grp . It would be better to use K-fold-CV for this model.

The interpretation here is that your model very badly misspecified and the data distribution has much thicker tails (there are outliers). No need to use K-fold-CV as it’s obvious that the model is very bad.

The next step forward would be tell more about your data and model. You might also get some ideas for the workflow in [2011.01808] Bayesian Workflow

And here’s a vignette Bayesian data analysis - roaches cross-validation demo, that illustrates a badly misspecified model (Section 2 Poisson) that could be improved by adding a parameter for each observation (Section 4 Poisson model with “random effects”) but then PSIS approximation is PSIS-LOO fails, and more robust model (Section 3, negative-binomial model) which corresponds to a Poisson model with random effects integrated out analytically. Your PPC failing model is like the Poisson, your PPC passing but high khat values getting model is like the Poisson with random effects.

Thanks, @avehtari !

There are definitely some outliers in the data. I’ll give k-fold CV a try.