WAIC unreliable?

CFS · May 23, 2017, 2:55pm

I’ve coded eight closely related non-linear hierarchical models with weak priors in Stan, fitted them to two similar data sets, and calculated each model’s WAIC.

For both data sets, the order of the models sorted by WAIC was the same. But I also got warning messages from the loo package that there were 95.0% p_waic estimates greater than 0.4. I tried PSIS-LOO, but nearly all Pareto k diagnostic values were bad or very bad.

The answer to this question might be obvious, but I’ll ask it anyway. Can I trust the WAIC values for model evaluation, considering that the models are closely related and the results were practically the same for both data sets, despite the warning messages?

I should also say that I tried to calculate the WAIC/PSIS-LOO of a similar model fitted to another data set, got the same warning messages, performed an actual 12-fold cross-validation, and got basically the same results from each technique.

bgoodri · May 23, 2017, 3:16pm

I don’t know of any circumstances where you can trust WAIC over PSIS-LOO. And the Pareto K warnings are telling you to not trust PSIS-LOO. It is unusual (and not good) to get that warning for almost every observation. Do you have observation-specific parameters?

CFS · May 23, 2017, 3:22pm

Yes, several parameters for each observation.

bgoodri · May 23, 2017, 3:29pm

I have seen models with observation-specific parameters give many Pareto K warnings, and then when you actually leave one observation out, it doesn’t have much of an effect on anyone else’s parameters or the common parameters. Still, PSIS-LOO is telling you that if you were to leave out any observation, it would have a non-negligible effect on the posterior distribution, particularly the margins of the posterior distribution that pertain specifically to the left out observation. So, I am not sure how much PSIS-LOO (or worse information criteria) can help you compare these models.

betanalpha · May 23, 2017, 7:19pm

Keep in mind that PSIS-LOO, WAIC, DIC, AIC, etc are all approximations to a fundamentally-uncalculatable number that exactly quantifies predictive accuracy. And each of those approximations require different assumptions to be reasonably accurate.

A common assumption that’s necessary is that the data are roughly IID so that, for example, looking at a subset of the data gives a reasonable quantification of what a draw from the full data would do. In many models, however, such as the one that Ben mentions this assumption is not valid and all of the predictive performance estimators should be suspect.

One of the great features of PSIS-LOO is that is has the self-diagnostic that can identify the failure of some of these assumptions and that makes it much more useful in practice than many of the other estimators. And in practice if we can’t trust PSIS-LOO then we probably can’t trust any of the others like WAIC.

Topic		Replies	Views
Posterior Predictives look good, but PSIS-LOO and WAIC are bad Modeling rstan , techniques , loo	6	1647	October 24, 2021
Not trustworthy WAIC and loo outcome? Modeling rstan , performance , loo	5	695	October 29, 2022
I used stan_jm with different option for assoc like etavalue, etaauc etc. How can I compare these models, any fitness information I can get from these model? rstanarm	5	847	January 22, 2021
High Pareto-k values for the same observations across different models: Can I still use loo to compare these models? Modeling loo	2	572	November 5, 2018
Model Comparison when high Pareto k? General rstan , loo	4	101	May 20, 2025

WAIC unreliable?

Related topics