In this case, m1 is actually the better model, as, what you say is correct: it outperforms m2 by more than 3 times the se_diff, if this is what you set as your criterion for calling a model “better”. That the numbers are small does not matter, as the absolute values of elpd are not meaningful in itself and similarly, we cannot judge by merely looking at elpd_diff to see whether it is “small” or “large”. This is why we need to consider se_diff, just as you already suggested.
EDIT: for people checking up on this in the future, @avehtari actually pointed out (below) that this difference can in fact be considered insignificant due to small numbers and that it is possible to interpret absolute numbers. I leave the point above to not break the conversation-flow but want to point out that what I was saying was not 100% correct.
yes this is correct.
Yes your interpretation is correct again. Did you maybe fit different models / data here? elpd_diff being so small while se_diff being so large could indicate that the two models fit the data equally well and that the absolute values of elpd are large, and so are their standard-deviations and hence is se_diff. Again, this is not by itself meaningful, as the absolute value of elpd does not tell us much without comparing it to something.
If you haven’t done so already, consider having a look at the loo glossary for more information about elpd.
Sorry I was not very clear in my wording above: I meant “different models / data” compared to example 1 and 2. Because, for different data or response variables the values for elpd can drastically change which would explain why the se_diff (and elpd?) might be so much bigger here compared to example 1 or 2…
Either way, your interpretations are correct I think :)
Thanks for correcting this, I added an EDIT note to my answer above. Just for my own understanding, because I find this surprising: When would you consider elpd_diff values small/large then, as from my understanding, as I mentioned above, I thought that the absolute value does not matter?
I don’t know why y is different in the last row, though I always used the same models for making these pp_check plots.
This is the full m2 model, family = hurde_lognormal. Received treatment hours is the y variable, there’s a lot of zero inflation.
fit = brm(bf(received_treatment_hours ~ predictor1 + … + predictor9 + (1 | region), hu ~ predictor1 + … + predictor9 + (1 | region), data = data, family = hurdle_lognormal(), cores = 3, chains = 3)
Interestingly, the predictions I am interested in, are consistent between the models and also similar to the splitted analysis (lognormal model and binomial model). What’s also interesting, that hierarchical structure did not give high se_diff values while comparing lognormal/binomial models in splitted analysis, but they became significant with hurdle models.
Do you mean “they (se_diff) became high with hurdle models”? Are you certain that the observations are in the same order for both models? That one plot raises doubt, and different orders could explain the high se_diff.