What size/specificity of effects are detected by LOO, WAIC, etc comparisons

Sometimes I have experienced that what look like quite robust effects don’t seem to show up as meaningfully different in terms of LOO and WAIC estimates in model comparison (and the effects might also replicate in separate studies I conduct). Is there any information about how precise posterior estimates need to be, or how large effects need to be, in order to show up as reliably different LOO or WAIC values in comparison to a model that does not include the effect of interest? Is this even a meaningful question?

I was thinking that it might be interesting to simulate some data with different sample sizes and underlying effect sizes and compare posterior distributions of effect size estimates with corresponding WAIC/LOO comparisons for models with or without the key comparison of interest included in the regression formula.

Of course I know that WAIC and LOO and not effect size measures, but I would think there should be a correspondence between them (larger and more precisely estimated posterior distributions for effects should be indicative of greater likelihood of more WAIC/LOO differences for a model with or without the key effect of interest, in my intuition).

Before I do this, is there some reason that this is just a silly/wrong way of thinking about things!?

1 Like

It is a meaningful question, but there is no generic answer and it’s best to do a simulation (or for some simple models you can do analytical approximations).

Yes. The results depend on your model and signal-to-noise-ratio.

There’s an example of such simulation at Bayesian data analysis - beta blocker cross-validation demo which is part of my Model assesment, selection and inference after selection | avehtari.github.io

1 Like

Thanks @avehtari - indeed I figured there would be no generic answer but good to know the question is at least sensible! I’ll check out your resources, it is great to find an overview of your approach to look out.

In the next few weeks I will probably make some simulations of simple group comparisons with different effects and post the results here, in case anyone is interested for reference.


In case you are interested @avehtari, I ran some simulations just looking at a comparison between 2 groups, with normally distributed data and equal standard deviations. The true effect size, as Cohen’s d (the difference in means/the pooled standard deviation), varied from 0 to 1 in steps of .25.

I simulated data at each effect size 300 times, and then each data set was split into different sample sizes from 50 to 300 participants per group, in steps of 50, to see how the ability to reach certain conclusions changes with the sample size. I then ran regressions on all the resulting data sets.

For loo comparisons, I simply compared 2 models - estimating the outcomes with vs. without group as a variable in the regression formulae.

I thought it might be informative to consider how often one reaches different decision criteria for deciding that you might have a difference between groups, basically like a power analysis. Of course, a whole range of criteria could be used so I just chose some examples of what might be done.

I see it written a fair bit that you want the different in ELPD to be at least about 3x the SE of the difference. I know there is no single number for any such conclusion but I just take this a starting point for the examples. Plot 1 here shows the ELPD difference divided by the SE of the difference, for each effect size and sample size (each color in each cluster of dots goes from the smallest to largest sample size, from left to right). You can see that the LOO assessments can more confidently identify the added value of including group in the regression with greater sample sizes and larger underlying effects.

We can also plot this like a power analysis where we assess the proportion of times across effect sizes and sample sizes that we reach the conclusion that group adds predictive value, and that the ELPD diff/SE of the diff exceeds 3:

We can also compare this to a simple metric, such as assessing the proportion of times the lower bound of the HDI for the effect size exceeds 0 (less stringent than a ROPE at some smallest effect size considered meaningful):

In this circumstance and with the decision criteria set up as I have described them (which is by no means the only or best option), we see that both the LOO approach and HDI excluding 0 approach are sensitive to both the sample size and effect size, as would be expected. The LOO decision criterion seems more conservative in that it generally has lower power to reach a conclusion about there being a positive difference between groups, but this also means there are less false positives than the HDI > 0 approach. From the Figure 1 ‘null.better’ panel you can also see how the LOO criterion also quite confidently rules out the added value of the group parameter when the effect size in 0, but there could be a fair number of false negatives when the effect is real but small.


That’s old news. See CV-FAQ #15, paper Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison and video.

This is partially because it focuses on predictive performance and not on the latent parameter, and partially because very weak assumption on future data distribution. Alternative approaches are less conservative if you are willing to make more assumptions and focus on the latent parameter (which may be difficult depending on the posterior correlations).

Anyway, it’s good to make this kind of simulations specifically for the models and types of data you expect to have to calibrate your expectations on what can be done.

1 Like

Thanks for these links, I’ll check them out.

Haha I thought that might be the case - just wanted to do a quick check to confirm that the general power approach I was trying would work at all!

Do you mean alternative cross-validation approaches, or different approaches like the posterior distribution for parameters and so on?