I’ve coded eight closely related non-linear hierarchical models with weak priors in Stan, fitted them to two similar data sets, and calculated each model’s WAIC.

For both data sets, the order of the models sorted by WAIC was the same. But I also got warning messages from the loo package that there were 95.0% p_waic estimates greater than 0.4. I tried PSIS-LOO, but nearly all Pareto k diagnostic values were bad or very bad.

The answer to this question might be obvious, but I’ll ask it anyway. Can I trust the WAIC values for model evaluation, considering that the models are closely related and the results were practically the same for both data sets, despite the warning messages?

I should also say that I tried to calculate the WAIC/PSIS-LOO of a similar model fitted to another data set, got the same warning messages, performed an actual 12-fold cross-validation, and got basically the same results from each technique.

I don’t know of any circumstances where you can trust WAIC over PSIS-LOO. And the Pareto K warnings are telling you to not trust PSIS-LOO. It is unusual (and not good) to get that warning for almost every observation. Do you have observation-specific parameters?

I have seen models with observation-specific parameters give many Pareto K warnings, and then when you actually leave one observation out, it doesn’t have much of an effect on anyone else’s parameters or the common parameters. Still, PSIS-LOO is telling you that if you were to leave out any observation, it would have a non-negligible effect on the posterior distribution, particularly the margins of the posterior distribution that pertain specifically to the left out observation. So, I am not sure how much PSIS-LOO (or worse information criteria) can help you compare these models.

Keep in mind that PSIS-LOO, WAIC, DIC, AIC, etc are all approximations to a fundamentally-uncalculatable number that exactly quantifies predictive accuracy. And each of those approximations require different assumptions to be reasonably accurate.

A common assumption that’s necessary is that the data are roughly IID so that, for example, looking at a subset of the data gives a reasonable quantification of what a draw from the full data would do. In many models, however, such as the one that Ben mentions this assumption is not valid and all of the predictive performance estimators should be suspect.

One of the great features of PSIS-LOO is that is has the self-diagnostic that can identify the failure of some of these assumptions and that makes it much more useful in practice than many of the other estimators. And in practice if we can’t trust PSIS-LOO then we probably can’t trust any of the others like WAIC.