Interpreting elpd_diff - loo package

Hi everyone, I have - hopefully - a straightforward question: When comparing the predictive ability of two models with loo, how big does elpd_diff have to be relative to its SE in order to conclude that one model is “better” than the other. A poster in the Psychological Methods Discussion Group on Facebook suggested that a rule of thumb is 1 SE. Does this sound reasonable? Is there a reference I can cite in my paper to justify the use of this (or anything else) as a rule of thumb?

Thanks
Ben

Question is a straightforward, but unfortunately the answer is not.

Short answer: The difference of 1 SE is definitely too small. If n is not small, and there is no bad observation model misspecification (ie. you have made model checking), and there are no khats>0.7, then as a rule of thumb, I would say that difference of 5 SE starts to be on the safe side.

A bit longer answer: Firstly, we need to have all PSIS khats<0.7, so that Monte Carlo error is not dominating (in the forthcoming version of loo package, we’ll provide also an estimate for this Monte Carlo error). Secondly, we know that SE estimate for a single model is optimistic with small n and in case of model misspecification (Grandvalet and Bengio, 2004). Grandvalet and Bengio (2004) show theoretically that true SE is less than 2 times the estimate. There is no similar result for model comparison, but we could assume it would be similar (we are researching this). The problem is further complicated as the uncertainty in the comparison is not necessarily well described by normal distribution with some SE, and especially for small n it would be better to take into account skewness and kurtosis, but it’s not so easy. We are researching ways to improve SE estimate and improve calibration of loo estimates. While you wait for new research results (and a better reference to cite), I would then suggest using 5 x SE, where I picked 5 as 2 x 2.5, where 2.5 would correspond to 99% interval, and 2 is the upper limit of error given by Grandvalet & Bengio (2004).

Instead of difference and se, you could also compute Bayesian stacking weights ([1704.02030] Using stacking to average Bayesian predictive distributions and soon available in loo package), and if the weight of a model is 0, it is worse than the models with positive weight.

Aki

5 Likes

Thanks very much, Aki. This is a shame as, even in my best case scenario, the elpd_diff for the theoretically-preferred model is only around 2.5 SEs; but I will just have to be conservative and conclude that there is no solid evidence for this model!

Thanks again for the reply, and for developing these very useful tools!
Ben

I would also say that based on loo, in this case there is no solid evidence that our theoretically-preferred would provide better predictions, but note also that there is no solid evidence that it would provide worse predictions than your alternative.

Overall loo is not good for detecting very small differences between models (and the same holds for WAIC, etc.). To detect small differences it is possible to add more assumptions about the future data, but then you need to check those assumptions. See more, e.g. http://dx.doi.org/10.1214/12-SS102 and http://link.springer.com/article/10.1007/s11222-016-9649-y

Aki

Hi Aki/everyone

A quick follow-up on this if I may? I was getting some surprising results with loo (i.e., not getting differences between models when I expected to). So, as a sanity check, I ran essentially the same analysis - with a predictor I “know” to be significant - two different ways (R syntax pasted below)

(a) As the only fixed effect in a standard Bayesian mixed-effects model
(b) Using model comparison with loo. i.e., comparing the model above to a random-effects only model

Method (a) suggested a large and reliable effect of the predictor of interest (“entrenchment”): M=0.62 SD=0.04
Method (b) suggested no evidence for this effect: elpd_diff = -2.6, SE = 2.8

I take Aki’s point that loo is not good for detecting small differences between models, but given the results of (a), this looks like a very large difference. Am I doing something wrong?

Syntax follows…
Thanks
Ben

Method (a) - Estimate direct from model

Sanity=glimmer(Un~ (1 + Entrenchment |PID) + (1|Verb) + Entrenchment, data=BOTH, family=gaussian, prefix=c(“b_”,“v_”), default_prior=“dnorm(0,1)”, iter=10000, adapt_delta = 0.99)
Sanity_M=map2stan(Sanity$f, data=Sanity$d)
precis(Sanity_M)

Method (b) - Model comparison

Sanity_Baseline=glimmer(Un~ (1 + Entrenchment|PID) + (1|Verb), data=BOTH, family=gaussian, prefix=c(“b_”,“v_”), default_prior=“dnorm(0,1)”, iter=10000, adapt_delta = 0.99)
Sanity_Baseline_M=map2stan(Sanity_Baseline$f, data=Sanity_Baseline$d)
Sanity_LOO = loo(WAIC(Sanity_M, pointwise=TRUE, loglik=TRUE))
Sanity_Baseline_LOO = loo(WAIC(Sanity_Baseline_M, pointwise=TRUE, loglik=TRUE))
loo::compare(Sanity_LOO, Sanity_Baseline_LOO)

I don’t understand what WAIC is doing here.

Hi Aki - thanks for getting back to me so quickly! I must admit I don’t really understand exactly what this code is doing (I think I based it on something I read here - https://github.com/rmcelreath/rethinking/issues/33), but I assume it’s supposed to be (as far I understand it) getting the pointwise values that loo compares? If it’s not getting the right values, this would explain why I’m getting surprising results out of loo

What should it be instead? Taking it out gives an error message:

Sanity_LOO = loo(Sanity_M, pointwise=TRUE, loglik=TRUE)
Error in UseMethod(“loo”) :
no applicable method for ‘loo’ applied to an object of class “map2stan”

As does trying to run compare on the models directly:

loo::compare(Sanity_M, Sanity_Baseline_M)
Error in loo::compare(Sanity_M, Sanity_Baseline_M) :
All inputs should have class ‘loo’.

Thanks
Ben

loo(Sanity_M@stanfit)

Sorry - like this? I’m still getting an error message:

Sanity_LOO = loo(Sanity_M@stanfit)
Error in UseMethod(“loo”) :
no applicable method for ‘loo’ applied to an object of class “stanfit”

I also don’t know what is map2stan. Clearly this is something else than rstan or rstanarm.

Maybe loo::loo(Sanity_M@stanfit).

Oh wait, that is wrong. I think you can just do LOO(Sanity_M, pointwise = TRUE).

It’s the Rethinking package, which - for the benefit of beginners like me - translates lme4 syntax into rstan syntax. The drawback is that the models that come out don’t seem quite like normal rstan/rstanarm objects- so I don’t know how to get them into the right format for loo

Thanks, Ben, but this says

Sanity_LOO = loo(Sanity_M, pointwise = TRUE)
Error in UseMethod(“loo”) :
no applicable method for ‘loo’ applied to an object of class “map2stan”

With all-caps LOO so that it calls the version in rethinking

Doesn’t rstanarm accept lme4 syntax?

Sanity_LOO = loo::loo(Sanity_M@stanfit)
Error in UseMethod(“loo”) :
no applicable method for ‘loo’ applied to an object of class “stanfit”

Thanks! OK that seems to work

Sanity_M_LOO = LOO(Sanity_M, pointwise = TRUE)
Sanity_Baseline_M_LOO = LOO(Sanity_Baseline_M, pointwise = TRUE)

But then it falls at the final hurdle:

loo::compare(Sanity_M_LOO, Sanity_Baseline_M_LOO)
Error in loo::compare(Sanity_M_LOO, Sanity_Baseline_M_LOO) :
All inputs should have class ‘loo’.

Um maybe? I think the problem here is that I don’t really understand the relationship between rstan, rstanarm and rethinking. Ben, don’t suppose you could shed any light?

You want to use rethinking:compare here.