Statistical significance of model comparison using ELPD

Hello,

I’d like to compare two models that are quite similar to each other, differing only in terms of one interaction. Specifically, I’d like to say if one of the models is significantly better or whether they are both comparable and there is no meaningful difference between them. I have read a few posts about this topic and would like to make sure I understand it correctly and if my approach is the right one. I’m adding the links below.

I have used Bambi and PYMC to specify and fit the model and then Arviz’s compare to get ELPD, ELPD difference and corresponding SE values. Here is an example table from my data:
image

My understanding is the following:

  1. If the ELPD difference is less than 4, then the models are comparable and thus not significantly different. This is not my case, as the ELPD difference is 29.15.
  2. If I want to compare them further, I can take the dSE and multiply it by a z-value that corresponds to a p-value for which I want to test the difference (as in wgoette’s response. So in this case, that would be 8.83*1.96=17.3 which is less than 29.15 and thus there is a significance difference between the models on the 5% significance level looking at two tail test. It’d be multiplied by 1.645 if I wanted to check only one side, which only makes sense here (is it correct to assume a one-tailed test?).

Is this the way to go?

Considering that the function also provides weight, is it also possible to use that? From the docs:

weight: Relative weight for each model. This can be loosely interpreted as the probability of each model (among the compared model) given the data. By default the uncertainty in the weights estimation is considered using Bayesian bootstrap

Would it be possible to claim that the models are comparable/significantly different based on their probability? If yes, what would be the threshold?

Lastly, what really puzzles me - why can’t I simply look at the distribution of the two ELPDs and check if they overlap? For instance, if the mean of one of them is contained in the SE of the other (which would be the case here)?

Thank you for your help and tips.

Sources:

  1. Model checking & comparison using loo vs loo_compare
  2. Cross-validation FAQ • loo
  3. Cross-validation FAQ • loo

See [2008.10296] Uncertainty in Bayesian Leave-One-Out Cross-Validation Based Model Comparison for details when normal approximation is valid.

It’s better to report the difference and dSE as they contain more information than stating significance given some arbitrary threshold.

In addition of reporting the difference and dSe, It makes sense to report the probability that model A is better than model B if you have just two model. If you have more than two models, see also Efficient estimation and correction of selection-induced bias with order statistics | Statistics and Computing

These weights are not probabilities, but they can be used to do model selection, too. They can be helpful if you have more than two models.

See the first 20min of BDA 2023 Lecture 9.1 Model selection and hypothesis testing part 1 and the above-mentioned paper. Maybe I should add this to CV-FAQ, too.

1 Like

None of Bambi, PyMC or ArviZ are Stan projects, but we can answer general stats questions.

I don’t know about Bambi or arViz, but PyMC has their own very active Discourse, which might be better for asking about Bambi and PyMC: