Interpreting elpd_diff - loo package

Benambridge · August 21, 2017, 9:15am

Hi everyone, I have - hopefully - a straightforward question: When comparing the predictive ability of two models with loo, how big does elpd_diff have to be relative to its SE in order to conclude that one model is “better” than the other. A poster in the Psychological Methods Discussion Group on Facebook suggested that a rule of thumb is 1 SE. Does this sound reasonable? Is there a reference I can cite in my paper to justify the use of this (or anything else) as a rule of thumb?

Thanks
Ben

avehtari · August 21, 2017, 10:43am

Question is a straightforward, but unfortunately the answer is not.

Short answer: The difference of 1 SE is definitely too small. If n is not small, and there is no bad observation model misspecification (ie. you have made model checking), and there are no khats>0.7, then as a rule of thumb, I would say that difference of 5 SE starts to be on the safe side.

A bit longer answer: Firstly, we need to have all PSIS khats<0.7, so that Monte Carlo error is not dominating (in the forthcoming version of loo package, we’ll provide also an estimate for this Monte Carlo error). Secondly, we know that SE estimate for a single model is optimistic with small n and in case of model misspecification (Grandvalet and Bengio, 2004). Grandvalet and Bengio (2004) show theoretically that true SE is less than 2 times the estimate. There is no similar result for model comparison, but we could assume it would be similar (we are researching this). The problem is further complicated as the uncertainty in the comparison is not necessarily well described by normal distribution with some SE, and especially for small n it would be better to take into account skewness and kurtosis, but it’s not so easy. We are researching ways to improve SE estimate and improve calibration of loo estimates. While you wait for new research results (and a better reference to cite), I would then suggest using 5 x SE, where I picked 5 as 2 x 2.5, where 2.5 would correspond to 99% interval, and 2 is the upper limit of error given by Grandvalet & Bengio (2004).

Instead of difference and se, you could also compute Bayesian stacking weights ([1704.02030] Using stacking to average Bayesian predictive distributions and soon available in loo package), and if the weight of a model is 0, it is worse than the models with positive weight.

Aki

Benambridge · August 21, 2017, 11:32am

Thanks very much, Aki. This is a shame as, even in my best case scenario, the elpd_diff for the theoretically-preferred model is only around 2.5 SEs; but I will just have to be conservative and conclude that there is no solid evidence for this model!

Thanks again for the reply, and for developing these very useful tools!
Ben

avehtari · August 21, 2017, 11:43am

I would also say that based on loo, in this case there is no solid evidence that our theoretically-preferred would provide better predictions, but note also that there is no solid evidence that it would provide worse predictions than your alternative.

Overall loo is not good for detecting very small differences between models (and the same holds for WAIC, etc.). To detect small differences it is possible to add more assumptions about the future data, but then you need to check those assumptions. See more, e.g. http://dx.doi.org/10.1214/12-SS102 and http://link.springer.com/article/10.1007/s11222-016-9649-y

Aki

Benambridge · September 19, 2017, 6:52pm

Hi Aki/everyone

A quick follow-up on this if I may? I was getting some surprising results with loo (i.e., not getting differences between models when I expected to). So, as a sanity check, I ran essentially the same analysis - with a predictor I “know” to be significant - two different ways (R syntax pasted below)

(a) As the only fixed effect in a standard Bayesian mixed-effects model
(b) Using model comparison with loo. i.e., comparing the model above to a random-effects only model

Method (a) suggested a large and reliable effect of the predictor of interest (“entrenchment”): M=0.62 SD=0.04
Method (b) suggested no evidence for this effect: elpd_diff = -2.6, SE = 2.8

I take Aki’s point that loo is not good for detecting small differences between models, but given the results of (a), this looks like a very large difference. Am I doing something wrong?

Syntax follows…
Thanks
Ben

Method (a) - Estimate direct from model

Sanity=glimmer(Un~ (1 + Entrenchment |PID) + (1|Verb) + Entrenchment, data=BOTH, family=gaussian, prefix=c(“b_”,“v_”), default_prior=“dnorm(0,1)”, iter=10000, adapt_delta = 0.99)
Sanity_M=map2stan(Sanity$f, data=Sanity$d)
precis(Sanity_M)

Method (b) - Model comparison

Sanity_Baseline=glimmer(Un~ (1 + Entrenchment|PID) + (1|Verb), data=BOTH, family=gaussian, prefix=c(“b_”,“v_”), default_prior=“dnorm(0,1)”, iter=10000, adapt_delta = 0.99)
Sanity_Baseline_M=map2stan(Sanity_Baseline$f, data=Sanity_Baseline$d)
Sanity_LOO = loo(WAIC(Sanity_M, pointwise=TRUE, loglik=TRUE))
Sanity_Baseline_LOO = loo(WAIC(Sanity_Baseline_M, pointwise=TRUE, loglik=TRUE))
loo::compare(Sanity_LOO, Sanity_Baseline_LOO)

avehtari · September 19, 2017, 7:07pm

I don’t understand what WAIC is doing here.

Benambridge · September 19, 2017, 7:23pm

Hi Aki - thanks for getting back to me so quickly! I must admit I don’t really understand exactly what this code is doing (I think I based it on something I read here - https://github.com/rmcelreath/rethinking/issues/33), but I assume it’s supposed to be (as far I understand it) getting the pointwise values that loo compares? If it’s not getting the right values, this would explain why I’m getting surprising results out of loo

What should it be instead? Taking it out gives an error message:

Sanity_LOO = loo(Sanity_M, pointwise=TRUE, loglik=TRUE)
Error in UseMethod(“loo”) :
no applicable method for ‘loo’ applied to an object of class “map2stan”

As does trying to run compare on the models directly:

loo::compare(Sanity_M, Sanity_Baseline_M)
Error in loo::compare(Sanity_M, Sanity_Baseline_M) :
All inputs should have class ‘loo’.

Thanks
Ben

bgoodri · September 19, 2017, 7:27pm

loo(Sanity_M@stanfit)

Benambridge · September 19, 2017, 7:30pm

Sorry - like this? I’m still getting an error message:

Sanity_LOO = loo(Sanity_M@stanfit)
Error in UseMethod(“loo”) :
no applicable method for ‘loo’ applied to an object of class “stanfit”

avehtari · September 19, 2017, 7:31pm

I also don’t know what is map2stan. Clearly this is something else than rstan or rstanarm.

bgoodri · September 19, 2017, 7:31pm

Maybe loo::loo(Sanity_M@stanfit).

bgoodri · September 19, 2017, 7:34pm

Oh wait, that is wrong. I think you can just do LOO(Sanity_M, pointwise = TRUE).

Benambridge · September 19, 2017, 7:35pm

It’s the Rethinking package, which - for the benefit of beginners like me - translates lme4 syntax into rstan syntax. The drawback is that the models that come out don’t seem quite like normal rstan/rstanarm objects- so I don’t know how to get them into the right format for loo

Benambridge · September 19, 2017, 7:36pm

Thanks, Ben, but this says

Sanity_LOO = loo(Sanity_M, pointwise = TRUE)
Error in UseMethod(“loo”) :
no applicable method for ‘loo’ applied to an object of class “map2stan”

bgoodri · September 19, 2017, 7:37pm

With all-caps LOO so that it calls the version in rethinking

avehtari · September 19, 2017, 7:37pm

Doesn’t rstanarm accept lme4 syntax?

Benambridge · September 19, 2017, 7:37pm

Sanity_LOO = loo::loo(Sanity_M@stanfit)
Error in UseMethod(“loo”) :
no applicable method for ‘loo’ applied to an object of class “stanfit”

Benambridge · September 19, 2017, 7:41pm

Thanks! OK that seems to work

Sanity_M_LOO = LOO(Sanity_M, pointwise = TRUE)
Sanity_Baseline_M_LOO = LOO(Sanity_Baseline_M, pointwise = TRUE)

But then it falls at the final hurdle:

loo::compare(Sanity_M_LOO, Sanity_Baseline_M_LOO)
Error in loo::compare(Sanity_M_LOO, Sanity_Baseline_M_LOO) :
All inputs should have class ‘loo’.

Benambridge · September 19, 2017, 7:42pm

Um maybe? I think the problem here is that I don’t really understand the relationship between rstan, rstanarm and rethinking. Ben, don’t suppose you could shed any light?

bgoodri · September 19, 2017, 7:48pm

You want to use rethinking:compare here.

Topic		Replies	Views
Loo comparison in reference to standard error General loo	10	3072	May 1, 2018
Interpreting output from compare() of loo Modeling loo , interpret-results	7	4186	March 27, 2024
Quick examples of loo() interpretation Modeling loo	11	1810	July 3, 2020
Quantifying Uncertainty with the LOO-CV criterion Modeling techniques , fitting-issues , specification , loo	10	140	March 31, 2025
SE of elpd_loo - loo package Modeling loo	2	787	July 6, 2018

Interpreting elpd_diff - loo package

Method (a) - Estimate direct from model

Method (b) - Model comparison

Related topics