In case you are interested @avehtari, I ran some simulations just looking at a comparison between 2 groups, with normally distributed data and equal standard deviations. The true effect size, as Cohen’s d (the difference in means/the pooled standard deviation), varied from 0 to 1 in steps of .25.

I simulated data at each effect size 300 times, and then each data set was split into different sample sizes from 50 to 300 participants per group, in steps of 50, to see how the ability to reach certain conclusions changes with the sample size. I then ran regressions on all the resulting data sets.

For loo comparisons, I simply compared 2 models - estimating the outcomes with vs. without group as a variable in the regression formulae.

I thought it might be informative to consider how often one reaches different decision criteria for deciding that you might have a difference between groups, basically like a power analysis. Of course, a whole range of criteria could be used so I just chose some examples of what might be done.

I see it written a fair bit that you want the different in ELPD to be at least about 3x the SE of the difference. I know there is no single number for any such conclusion but I just take this a starting point for the examples. Plot 1 here shows the ELPD difference divided by the SE of the difference, for each effect size and sample size (each color in each cluster of dots goes from the smallest to largest sample size, from left to right). You can see that the LOO assessments can more confidently identify the added value of including group in the regression with greater sample sizes and larger underlying effects.

We can also plot this like a power analysis where we assess the proportion of times across effect sizes and sample sizes that we reach the conclusion that group adds predictive value, and that the ELPD diff/SE of the diff exceeds 3:

We can also compare this to a simple metric, such as assessing the proportion of times the lower bound of the HDI for the effect size exceeds 0 (less stringent than a ROPE at some smallest effect size considered meaningful):

In this circumstance and with the decision criteria set up as I have described them (which is by no means the only or best option), we see that both the LOO approach and HDI excluding 0 approach are sensitive to both the sample size and effect size, as would be expected. The LOO decision criterion seems more conservative in that it generally has lower power to reach a conclusion about there being a positive difference between groups, but this also means there are less false positives than the HDI > 0 approach. From the Figure 1 ‘null.better’ panel you can also see how the LOO criterion also quite confidently rules out the added value of the group parameter when the effect size in 0, but there could be a fair number of false negatives when the effect is real but small.