Good PP check and R square but large Pareto k values

Recently, I encountered a problem with using PSIS and Pareto \hat{k} for model diagnostics and selection. After reading a few threads and notes:

Here is a visualization of the data lasrosas.corn from agridat package.


I started with a simple linear model Y=X\beta+e (model 1) and then added a random term Y=X\beta+Zu+e (model 2, random regression coefficients), where var(u)=\Sigma_u\otimes\Sigma_w, \Sigma_u=\begin{bmatrix}\sigma_1 & & \\ & \sigma_2 & \\ & & \sigma_3\end{bmatrix}LKJcorr(1)\begin{bmatrix}\sigma_1 & & \\ & \sigma_2 & \\ & & \sigma_3\end{bmatrix}, \Sigma_w is a predifined matrix with two parameters.

Model 2 is better than model 1 and this can be seen from LOO and PP check.

Bayes R^2 for model 1 and 2 with median, Q2.5 and Q97.5:

Model1 0.007883228, 0.001704781, 0.0171827
Model2 0.9717016, 0.9741885, 0.9764307

LOO for model 1

Computed from 4000 by 1674 log-likelihood matrix

         Estimate   SE
elpd_loo  -7803.0 12.9
p_loo         3.2  0.1
looic     15606.0 25.8
Monte Carlo SE of elpd_loo is 0.0.

All Pareto k estimates are good (k < 0.5).
See help('pareto-k-diagnostic') for details.

and model 2.

Computed from 2400 by 1674 log-likelihood matrix

     Estimate    SE
elpd_loo  -4948.9 136.0
p_loo       338.9  41.2
looic      9897.8 272.1
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                     Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     1551  92.7%   233       
 (0.5, 0.7]   (ok)        118   7.0%   69        
   (0.7, 1]   (bad)         3   0.2%   31        
   (1, Inf)   (very bad)    2   0.1%   2         
See help('pareto-k-diagnostic') for details.

Even though model 2 is better than model 1, it has more bad values and terrible LOO-PIT plot.

Then I run a model 3 where I put bad values which have larger \hat{k}>0.7 as outliers in the model and indicate them with 1 and otherwise 0.

R^2 of Model3 0.9777519 0.9797304 0.9814732
and LOO

 Computed from 2000 by 1674 log-likelihood matrix
             Estimate    SE
    elpd_loo  -4791.8  53.3
    p_loo       402.7  26.8
    looic      9583.7 106.5
    Monte Carlo SE of elpd_loo is NA.

    Pareto k diagnostic values:
                             Count Pct.    Min. n_eff
    (-Inf, 0.5]   (good)     1490  89.0%   163       
     (0.5, 0.7]   (ok)        166   9.9%   102       
       (0.7, 1]   (bad)        15   0.9%   43        
       (1, Inf)   (very bad)    3   0.2%   3    

R^2 is slightly increased, but the PP check doesn’t change too much. For the PSIS diagnostic, I have more bad values in model 3.

Then it looks like the better the model is, the more bad values there are.

After that, I am ambitious and run a very complex model 4, which might be overfitting and checked the same criteria, and got the following results:

Model4 0.9992851 0.9995073 0.9996659

Computed from 4000 by 1674 log-likelihood matrix
         Estimate   SE
elpd_loo  -4506.7 31.3
p_loo       974.0 11.1
looic      9013.4 62.5
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)      86    5.1%   450       
 (0.5, 0.7]   (ok)       718   42.9%   132       
   (0.7, 1]   (bad)      834   49.8%   17        
   (1, Inf)   (very bad)  36    2.2%   4         
See help('pareto-k-diagnostic') for details.

Model 4 has the smallest looic value and highest R^2, but also has more bad values.

Now I am confused. These model diagnostic tools and selection methods have contradictions to each other. If I only focus on one selection criterion, such R^2 or looic model 4 is the best. However, PSIS-k tells me that this model is misspecified as there are more bad values. Well, it is still possible that model 4 is good but very flexible, and PSIS failed on it.

I believe model 2 is better in terms of parsimony and balance among these criteria, but couldn’t convince myself about these bad \hat{k} values and the terrible LOO-PIT plot. Might be flexible as well?

Here are my questions:

  1. Does that mean model 2 is misspecified?
  2. If model 2 is not misspecified but flexible, because p_loo = 338.9 < p, and p > 1674/5=334.8, how do I explain these bad k values and LOO-PIT plot, or just ignore them?
  3. Does running a longer chain help with reducing bad k values?
  4. Should I use model 4 and conclude PSIS-k fail in this scenario?
  5. k-fold CV might be good for comparison, but it does not address the problem.
  6. Sampling directly from p(\theta^s\mid y_{-i}), is there a simple and faster way to do it?

Thank you all for reading my post and providing comments.

R version 3.6.3 (2020-02-29)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
other attached packages:
 [1] rstantools_2.1.1      loo_2.3.1        bayesplot_1.7.2      brms_2.13.5          rstan_2.19.3

28/08 a comment: I probably know why LOO and LOO-PIT failed on this problem. That’s because the data is correlated to its neighbouring points. For LOO, the assumption is that the data are independent and each point is not affected by its neighbours. However, for the data and model 2, we assume that it is autocorrelated with parameter \rho. To address this problem, I might have to run a moving window CV, which omits a cluster of correlated data points for testing. This is just my guess. Other comments are welcome. Thank you.


This is wrong. 1) Looking at the data, the linear model is wrong model and there are not really “outliers” 2) If you would have outliers you should change the model so that any observation can be an outlier and let the data influence the posterior. Don’t use PSIS-LOO Pareto k’s as outlier detector for the next step model.

Based on the results you show, it looks like the more flexible the model is, the more overfitting it is, and the more overfitting is indicated by high Pareto k values.

These measures are not independent. If there are many high Pareto k values as in case of model 4, then elpd_loo (or looic) can’t be trusted. Even if there would be no high Pareto k values, R^2 can’t be trusted if p_loo is relatively high compared to the total number of parameters or the number of observations as in case of model 2-4. So there is no contradiction here, but you need to take into account if diagnostics tell you that some other measures can’t be used.

When you don’t get high Pareto k values, then LOO does the balancing automatically and you can directly compare elpd_loo values with considering the amount of pasimony yourself.

Looking at the plotted data and the model equation it’s clear that it’s misspecified. The LOO-PIT plot confirms this (there some high Pareto k values, but if LOO-PIT plot looks bad with high Pareto k values, it would look bad also without)

It’s both misspecified and flexible. High Pareto k values are probably mostly due to the flexibility. Don’t ignore them.

Sometimes. In the PSIS paper there are illustrations that sometimes the importance ratio saturates far in the tail of the posterior and then a larger number of draws can eventually get there and the behavior of the estimator improves. You can for example doubling the chain length, and you might get few Pareto k’s to be smaller, but it’s unlikely that you would get them all below 0.7 in this case.

You didn’t tell what is model 4. You should do the model checking with some other approach, e.g. k-fold-CV.

Does not address which problem? It would give you reliable cross-validation in cases where PSIS fails, and it would be better than non-cross-validated R^2.

Simple way is to run Stan with data where you remove ith observation. Fast way is to use PSIS-LOO, which has diagnostic to tell when the fast way fails.

No. LOO-PIT did show the model mispecification correctly. No failing there. PSIS-LOO had some high Pareto k values for models 1-3 and many for model 4, but nothing to do with correlated data.

No. This is false twisted statement making rounds in the internet. If the data would be independent so that you could not learn anything about the value of function given neighboring observations you could not learn anything. See more in CV-FAQ question 9.

Looking at the data you show, you should use a non-linear model such as splines or Gaussian process.


Dear @avehtari, thank you so much for replying my inquiry and addressing the questions.

Apologise in advance that I didn’t explain the data and models clearly in the initial post.
At this stage, we are using a random regression coefficient model to fit the corn yield data lasrosas.corn from the package agridat. We convert it to a 93\times18 grid with 1674 units. What we are doing here is to find the optimal nitrogen level N for each unit. However, in the data file, only one treatment is observed at each unit and is not possible to find the optimal solution (it is possible to find the average optimal solution across the field but not for each unit). Then we cooperated with the RRC model, suggested by H. Piepho, et al in equations (1) ~ (4).
Mathematically, the model is y_i = \beta_0+\beta_1N_i+\beta_2N_i^2+ (\beta_{0i}+\beta_{1i}N_i+\beta_{2i}N_i^2)+e_i for each unit i=1,\ldots,n, N_i is the nitrogen in quadratic form to yield. For model 1.1 (different from linear model 1), I used var(\beta)=\Sigma_\beta\otimes I, but for model 2 var(\beta)=\Sigma_\beta\otimes \Sigma_w, and \Sigma_w is a spatial covariate matrix.
However, both of these models are misspecified according to LOO-PIT plot and Pareto k. We would like to try the mixture Gaussian model later on.

I used PSIS-LOO Pareto k and found a few highly influential data, among which there are two suspicious outliers and can be seen from the scatter plot (one is in purple at the left bottom corner and the other one is in green on the right side). Instead of deleting them (because some packages are incapable of dealing with NA values), I put an extra random term, let’s say Z_1u_1, in the model. I thought it would fix the issue and augment the model. But it didn’t and brought more high \hat{k} values.

Model 4 is model 2 plus s(long_i,lat_i)+replicate_i, a thin-plate regression surface and a replication term.

k-fold is good for model comparison but the high Parote k values issue still exists as long as the model is misspecified.

  1. Here comes another question: I don’t understand why the wrong model 1 has better Pareto k performance and smaller loo_p value?

  2. Here is another interesting issue about LOO-PIT plot: with two different LOO-PIT codes, the plots for the same model with same data are different:

### code 1
psis_result <- psis(-loglik_randU)
  y = data.randU$Y,
  yrep = y_rep_randU,
  lw = weights(psis_result)
### code 2
psis_result <- psislw(-loglik_randU)
pit1.randU <- loo_pit(object = y_rep_randU,
                      y = data.randU$Y, 
                      lw = psis_result$lw_smooth)
unifs.randU <- matrix(runif(length(pit1.randU) * 100), nrow = 100)
ppc_dens_overlay(pit1.randU, unifs.randU)

Did I misunderstand something in the code? I used code 1 and re-plotted LOO-PIT for model 2, and the result looks okay to me. Then can I trust model 2 even though there are 5 high Pareto \hat{k} values with p_loo is 338.9 \approx 1674/5=334.8?

Please accept my gratitude. Thank you.

Out of curiosity where is this misunderstanding making rounds on the internet?

Hi @jonah, thanks for your interests on this thread.

I found it from this paper The spatial leave-pair-out cross-validation method for reliable AUC estimation of spatial classifiers that says “Second, spatial autocorrelation causes standard CV methods to produce optimistically biased prediction performance estimates for spatial data. This is caused by the fact that leave-one-out and K-fold rely on the assumption that the data is independent and identically distributed (i.i.d.) —an assumption violated by spatial data where close instances tend to be more similar than ones distant from each other” in the top paragraph on page 732.

However, I didn’t find any similar statements in other publications. Then I came across the page Cross validation of time series data explains CV. It says “However, classical cross-validation techniques such as KFold and ShuffleSplit assume the samples are independent and identically distributed, and would result in unreasonable correlation between training and testing instances (yielding poor estimates of generalisation error) on time series data.”

This makes more sense that training and testing data sets should be independent, not data itself.

Please feel free to point out any mistakes that I made here in the thread. I appreciate your time and help.

1 Like

Thanks for the details.

This is true for many spatial models and we have a short paper and case study about it (I think we focus on temporal correlation but same goes for spatial correlation):

It’s this part that is not quite right because the data doesn’t have to be i.i.d., it just needs to be conditionally independent given the model parameters. The challenge with some spatial and temporal models is that there is residual dependence even after conditioning on the parameters. But for many of the most common cases of this there are ways to get around the problem, which we discuss in the case study and paper linked above.

Hi @jonah, thanks for the reply and the links to the paper. I have read it and the approximate LOO-CV works exactly the same as exact LOO-CV in my case study. Good job. I hope it will be published in the near future. I thought that the PSIS-LOO-CV didn’t work well for my model is because the data is spatially correlated. But in fact, it is all about the model. It might be a misstatement in the paper.

However, when I tried the following code for PIT-LOO, the results are different. Code 1 is from the package bayesplot and Code 2 is from the paper “Visualization in Bayesian workflow”.

The result from ppc_loo_pit_overlay looks promising and indicates model 2 is not misspecified, and p_loo is relatively small compared to number of parameters and observations. Here are my two questions:

  1. Is there a paper discussing p_loo, such as what this thread “A quick note what I infer from p_loo and Pareto k values” does?
  2. Can I trust model 2 even though it has 5 high Pareto \hat{k} values?

Thanks again for your help.

Are those overlaid lines in the data plot in your first post from this model? I’m suspicious of this model type.

Yes, I can see them now, too.

Adding an extra term that is influenced just by one data point, makes that data point very influential and highly influential data point can make khat also high even if model would be correctly defined. Solutions are to integrate analytically or e.g. with Laplace approximation over the random effects or try Student’s t as an observation model.

It seemed that in addition of misspecification you had also possibility very flexible model that PSIS can’t handle.

Simpler model posterior changes less when the posterior changes. Pareto k is a measure of influence, the high influence can come either from bad misspecification or from very flexible model. Other models are more complex, and thus have higher estimated complexity (p_loo) and PSIS fails more often as the posterior is sensitive to removal of observations.

Sorry, I didn’t have time to figure this out.

As Jonah said, it’s important to say “conditionally independent” and even that is not always required, and better would be to say exchangeable (see, e.g. Chapter 5 in BDA3).

Another mistake is to confuse between the dependency structure and the prediction task. People are just fine doing LOO or K-fold-CV for univariate or bivariate non-linear regression where it is also assumed that the nearby observations are likely to be similar. Switching to time or spatial case doesn’t change the assumption about the similarity assumption of nearby observations, but often changes the prediction task as the future observation is not necessarily exchangeable with the past observations (ie future observation has time larger than all past observations). If the prediction task is to predict to the future or regions outside of the current regions, then the that prediction task changes how we want to simulate predictions in cross-validation. To repeat: time series or spatial data is not violating anything in cross-validation just because close instances tend to be more similar than ones distant from each other.

For usual CV, training and test data sets should be exchangeable (no need for independence), and if the prediction task is to predict for time after all past observations then the time information is the additional information breaking the exhangeability.

No paper but CV-FAQ has bit more, and I guess we’ll include this discussion to our next book.


Dear @avehtari, thank you again for your time replying this thread and addressing my questions.

Yes, here is the pp check plot of model 2 and it seems promising. However, in the first peak, the model still misses some observations.

Yes, you are right. By adding an extra term to model 2, model 3 has more high \hat{k} values.

That’s all right. You have helped me a lot. I will keep it in mind and try to figure it out.

exchangeable not independent. Thank you for making it clear to me. This principle is not well explained in references.

I appreciate the information and advice you have shared.

1 Like

Exchangeable term is used in Bayesian theory and it’s weaker condition than independence but also more descriptive I think. If you read non-Bayesian cross-validation literature they are unlikely to use exchangeability term which makes things more complicated.

1 Like

Dear @avehtari, thanks for the quick reply. I didn’t know that the concept is only in Bayesian analysis. I encountered this problem when I was reading “Visualization in Bayesian workflow” and applying it to the data set. Thank you for pointing it out and distinguishing Bayesian and non-Bayesian CV. I very much appreciate your help.