Hi, dear all, I am writing to ask a general question about Bayesian model comparisons.
In general, we may use log pseudo marginal likelihood (LPML) and WAIC as criteria to evaluate and compare the goodness of model fitting. However, here comes a problem I met in both simulation studies and real data analysis.
In simulation studies, when I check the L_2 distance between posterior predicted distributions and real distributions, I found Model A has a lower deviation while neither LPML nor WAIC supports model A but the competitor Model B. So how to say which model is “better” on this simulated data set?
Again in a real data example for classification, we find that results of Model C provide higher AUC value while both LPML and WAIC support the competitor Model D instead.
So how to judge which model is more suitable for this kind of data set?
Hint: all LPML and WAIC are computed by loo package.
@avehtari should be able to provide a qualitative assessment of the pros and cons in this case.
They are the same criterion. PSIS-LOO computation is in general more accurate and has better diagnostics than WAIC. Computing both of them doesn’t provide any more information than computing just the elpd_loo with PSIS-LOO.
In general, there is no theoretical reason that two different utilities/lossess would provide the same ranking. You don’t provide enough information to say much more, but 1) L2 distance is less sensitive to differences in tails (which is not good if you care about the tails), 2) the differences can be that small that the difference in ranking doesn’t matter.
AUC considers only the ranking of the predicted probabilities, but doesn’t care at all whether those probabilities are bad, while log score penalizes a lot if the predicted probabilities are overconfident.
If you care about the whole distribution then log-score is the right choice (see, e.g. Bernardo & Smith, 1994). If in addition of modeling (conditional) distribution of the data, you also have some decision task or need to act based on the model, then you can consider other utilities / losses. This can be useful when your model is misspecified (so that log score tells it’s bad), but that misspecification doesn’t influence your task specific utility/loss. AUC is a strange one, as it corresponds to average over different ratio of losses for right and wrong decisions, but that doesn’t correspond to any real life binary prediction/action task. If you tell more about your application and what kind of decisions will be made using the models, I can comment more on useful utilities / losses.