Recently, I encountered a problem with using PSIS and Pareto \hat{k} for model diagnostics and selection. After reading a few threads and notes:
 A quick note what I infer from p_loo and Pareto k values,
 Recommendations for what to do when k exceeds 0.5 in the loo package?,
 Improve model with some observations Pareto >0.7,
 Bayesian data analysis  roaches crossvalidation demo,
 16 What to do if I have many high Pareto k’s?

Pareto K for outlier detection,
I am still confused about the results of my models.
Here is a visualization of the data lasrosas.corn
from agridat
package.
I started with a simple linear model Y=X\beta+e (model 1) and then added a random term Y=X\beta+Zu+e (model 2, random regression coefficients), where var(u)=\Sigma_u\otimes\Sigma_w, \Sigma_u=\begin{bmatrix}\sigma_1 & & \\ & \sigma_2 & \\ & & \sigma_3\end{bmatrix}LKJcorr(1)\begin{bmatrix}\sigma_1 & & \\ & \sigma_2 & \\ & & \sigma_3\end{bmatrix}, \Sigma_w is a predifined matrix with two parameters.
Model 2 is better than model 1 and this can be seen from LOO and PP check.
Bayes R^2 for model 1 and 2 with median, Q2.5 and Q97.5:
Model1 0.007883228, 0.001704781, 0.0171827
Model2 0.9717016, 0.9741885, 0.9764307
LOO for model 1
Computed from 4000 by 1674 loglikelihood matrix
Estimate SE
elpd_loo 7803.0 12.9
p_loo 3.2 0.1
looic 15606.0 25.8

Monte Carlo SE of elpd_loo is 0.0.
All Pareto k estimates are good (k < 0.5).
See help('paretokdiagnostic') for details.
and model 2.
Computed from 2400 by 1674 loglikelihood matrix
Estimate SE
elpd_loo 4948.9 136.0
p_loo 338.9 41.2
looic 9897.8 272.1

Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(Inf, 0.5] (good) 1551 92.7% 233
(0.5, 0.7] (ok) 118 7.0% 69
(0.7, 1] (bad) 3 0.2% 31
(1, Inf) (very bad) 2 0.1% 2
See help('paretokdiagnostic') for details.
Even though model 2 is better than model 1, it has more bad values and terrible LOOPIT plot.
Then I run a model 3 where I put bad values which have larger \hat{k}>0.7 as outliers in the model and indicate them with 1 and otherwise 0.
R^2 of Model3 0.9777519 0.9797304 0.9814732
and LOO
Computed from 2000 by 1674 loglikelihood matrix
Estimate SE
elpd_loo 4791.8 53.3
p_loo 402.7 26.8
looic 9583.7 106.5

Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(Inf, 0.5] (good) 1490 89.0% 163
(0.5, 0.7] (ok) 166 9.9% 102
(0.7, 1] (bad) 15 0.9% 43
(1, Inf) (very bad) 3 0.2% 3
R^2 is slightly increased, but the PP check doesn’t change too much. For the PSIS diagnostic, I have more bad values in model 3.
Then it looks like the better the model is, the more bad values there are.
After that, I am ambitious and run a very complex model 4, which might be overfitting and checked the same criteria, and got the following results:
Model4 0.9992851 0.9995073 0.9996659
Computed from 4000 by 1674 loglikelihood matrix
Estimate SE
elpd_loo 4506.7 31.3
p_loo 974.0 11.1
looic 9013.4 62.5

Monte Carlo SE of elpd_loo is NA.
Pareto k diagnostic values:
Count Pct. Min. n_eff
(Inf, 0.5] (good) 86 5.1% 450
(0.5, 0.7] (ok) 718 42.9% 132
(0.7, 1] (bad) 834 49.8% 17
(1, Inf) (very bad) 36 2.2% 4
See help('paretokdiagnostic') for details.
Model 4 has the smallest looic
value and highest R^2, but also has more bad values.
Now I am confused. These model diagnostic tools and selection methods have contradictions to each other. If I only focus on one selection criterion, such R^2 or looic
model 4 is the best. However, PSISk tells me that this model is misspecified as there are more bad values. Well, it is still possible that model 4 is good but very flexible, and PSIS failed on it.
I believe model 2 is better in terms of parsimony and balance among these criteria, but couldn’t convince myself about these bad \hat{k} values and the terrible LOOPIT plot. Might be flexible as well?
Here are my questions:
 Does that mean model 2 is misspecified?
 If model 2 is not misspecified but flexible, because p_loo = 338.9 < p, and p > 1674/5=334.8, how do I explain these bad k values and LOOPIT plot, or just ignore them?
 Does running a longer chain help with reducing bad k values?
 Should I use model 4 and conclude PSISk fail in this scenario?
 kfold CV might be good for comparison, but it does not address the problem.
 Sampling directly from p(\theta^s\mid y_{i}), is there a simple and faster way to do it?
Thank you all for reading my post and providing comments.
R version 3.6.3 (20200229)
Platform: x86_64w64mingw32/x64 (64bit)
Running under: Windows 10 x64 (build 18363)
other attached packages:
[1] rstantools_2.1.1 loo_2.3.1 bayesplot_1.7.2 brms_2.13.5 rstan_2.19.3
28/08 a comment: I probably know why LOO and LOOPIT failed on this problem. That’s because the data is correlated to its neighbouring points. For LOO, the assumption is that the data are independent and each point is not affected by its neighbours. However, for the data and model 2, we assume that it is autocorrelated with parameter \rho. To address this problem, I might have to run a moving window CV, which omits a cluster of correlated data points for testing. This is just my guess. Other comments are welcome. Thank you.