High Pareto-k values for the same observations across different models: Can I still use loo to compare these models?


The PSIS approximation to LOO in the LOO R package shows where approximate LOO fails. Are models still comparable, if they fail in “the same” way?

I have a moderately large panel data set (cross-section, time series) and to compare different model specifications I use approximate LOO. I don’t want to make predictions in t+1 or predicting another completely new cross-section or something, I just want to asses and compare different model, so I think using LOO is ok for what I want. (Thank you, @avehtari, for the great tutorial in Helsinki. That made some of this very clear to me.) Other Leave-something-out and CV approaches would also probably be hardly feasible computationally in my case…

Usually I get 1.5% to 4% “bad” or “very bad” Pareto k diagnostic values for reasonably complex models (“very bad” values make up for about a tenth of the >0.7 Pareto-k values). I remembered Jonah Gabry said at StanCon that Ben Goodrich fund that high Pareto-k values can sometimes indicate errors in the data and thus I checked which observations have bad Pareto-k values and at first glance all compared models fail more or less the same way (depending a bit on their complexity). Its very plausible, that the data that I have is just very weird in places.

I know that the best way to deal with this would be to somehow model the issues in my data, and I certainly will come back to that problem. But I’d love some input to know if I can proceed for now.

I’ve tried tighter (and more informative) priors and more post-warmup iterations (with thinning, because the stanfits are getting too big otherwise). It gets a bit better with this, but not ok in the all Pareto-k<0.7 sense.

Many thanks in advance!

No. If some Pareto k values are “very bad”, then the ELPD estimator does not have a finite variance. So, the difference between two of them is unlikely to have a finite variance either.

Thank you, @bgoodri!