Interpreting elpd_diff - loo package

rstanarm, rethinking, and brms for that matter all call the sampling function in the rstan package. The rstanarm package comes with compiled models, while rethinking and brms generate Stan code at runtime that has to be compiled. There are also differences in the default priors used in the three packages.

Sorry, still no luck!

Sanity_M_LOO = LOO(Sanity_M, pointwise = TRUE)
Constructing posterior predictions
[ 1000 / 1000 ] 200 / 1000 ]
Sanity_Baseline_M_LOO = LOO(Sanity_Baseline_M, pointwise = TRUE)
Constructing posterior predictions
[ 1000 / 1000 ] 200 / 1000 ]
rethinking::compare(Sanity_M_LOO, Sanity_Baseline_M_LOO)
Error in nobs.default(X[[i]], …) : no ‘nobs’ method is available
No WAIC method for object of class ‘numeric’. Returning AIC instead.
Error in UseMethod(“logLik”) :
no applicable method for ‘logLik’ applied to an object of class “c(‘double’, ‘numeric’)”

Would I be better off trying to convert the model to a format the loo package can understand, rather than using the loo/compare function of rethinking? (that’s what I was trying to do initially - but it seems incorrectly!)

I checked this, and it’s possible that you were computing the correct thing, but I would not recommend using that code publicly because it’s super confusing to have loo(WAIC(…)).

I think Un~ (1 + Entrenchment |PID) + (1|Verb) + Entrenchment should work with stan_glmer function in rstanarm, too.


  1. Un~ (1 + Entrenchment |PID) + (1|Verb) + Entrenchment
  2. Un~ (1 + Entrenchment |PID) + (1|Verb)

is difficult. Even with direct estimate, depending on the priors both Entrenchment |PID and Entrenchment can explain same things unless Entrenchment |PID is constrained to have zero mean (which I guess it’s not happening). With predictive performance estimates, it’s even more likely that Entrenchment |PID can predict almost as well as Entrenchment |PID + Entrenchment, and big difference could be seen only if each group would very few observations and there would be many groups. Things get more complicated if Entrenchment correlates with other covariates.

rstanarm::stan_lmer(Un~ (1 + Entrenchment |PID) + (1|Verb) + Entrenchment, data=BOTH) should be fine, followed by loo and compare_models.

OK thanks! So what this means conceptually is that, when the leave one out method is trying to predict the value of the held-out datapoint, it can do a very good job just by considering the by-participant variance in the effect of entrenchment; so much so that adding the fixed effect of entrenchment itself doesn’t really help?

I get again confused by fixed and random. See here Looic and elpd_diff (rstanarm model) reasons for not to use those and what would be better terminology.

If you have several observations from a participant(?) indexed by PID, then after removing just one observation, it’s still easy to predict that based on the other observations from that participant. If you want to examine the effect of the population mean, then you would need to leave-out all observations for one participant at time. Then you could see the difference whether the population mean is zero (model Un~ (1 + Entrenchment |PID) + (1|Verb)) or non-zero (model Un~ (1 + Entrenchment |PID) + (1|Verb) + Entrenchment).

To implement leave-one-participant-out cross-validation you need to do some coding.

@jonah for rstanarm it would be great to have option for kfold to make the data division based on given variable, like in this case PID, to make leave-one-group-out easy.

Aha that makes so much sense! Thanks!!

I’m late to this party. Maybe better to start over with how to call WAIC and LOO and compare() from within rethinking.

map2stan objects have methods for WAIC and LOO. These methods extract pointwise log likelihoods and then compute WAIC and PSIS-LOO respectively. The pointwise=TRUE argument returns pointwise rather than summed.

The rethinking::compare function takes model fits (map2stan or stanfit objects), not pointwise lists of log likelihoods. If you want it to use LOO, then use the func=LOO argument.

WAIC and LOO work with stanfit as well, provided the model defines a log_lik vector. Example in ?WAIC.

Note that the LOO support in rethinking is not advertised nor entirely tested. It just calls the loo package, though, so probably no surprises.

Thanks very much Richard - I’m starting to think that the comparison was probably in fact working correctly (though I will double check by doing everything within rethinking) - the lack of a difference is probably due to Aki’s point about a participants’ own performance being the best predictor of their held-out datapoint.

Aki, presumably this is also true at the level of (1|Verb) - participants are proving ratings for 160 different verbs. I.e., presumably the best predictor of a rating for that verb is all other ratings for that verb (as indexed by 1|Verb), rather than the predictors I’m interested in? If this is the case, then presumably I need to hold out both (a) all the other ratings for that participant and (b) all the other ratings for that verb, for each held-out datapoint?

Thanks everyone!


As we discussed in the blog, it would be nice if PSIS-LOO would be advertised, as it has the Pareto diagnostic. Every time someone has some problem with WAIC, we ask to report Pareto diagnostics from loo function, so that we have better idea where’s the problem. (I did also investigate similar diagnostics for WAIC, but they were not working well).

Hi everyone

Thanks again for all your help! One final PS if I may…

Is there any way to do model comparison with Bayesian models that does NOT use leave-one-out validation? I have read elsewhere that global measures like AIC, BIC, DIC and WAIC are not suitable, but is there any other alternative?

Aki’s solution makes sense in principle; but in practice, I think it’s going to be well beyond my coding abilities. Also, given that my goal in doing model comparison is just to find out whether a particular predictor improves the model, the predictive ability of the rival models (as measured by loo) is perhaps not exactly what I most need to know?

Thanks again!

It is important to recognize that all of these measures are approximations to a score that compares some predictive distribution to a true data generating process, with various levels of approximation. For example, AIC approximates how “close” a point predictive distribution is to the true data generating process, BIC approximates the prior predictive, and WAIC -> DIC approximate the posterior predictive. Cross validation of various forms can also be applied to quantifying all three of these predictive distributions.

Consequently your only two choices are which distribution will you use to make predictions and then how will you estimate “closeness”, otherwise it’s all the same.

See, Section 3, for more detail, although admittedly a bit high level. Unfortunately I don’t think we have a more cursory treatment of these matters yet.

1 Like

You write yourself “just to find out whether a particular predictor improves the model”. Improves compared to what? If you want to know improvement you need to define the baseline. You could use just one model and look at the marginal posterior of the effect if you would know that the effect is independent, but it seems that in your case you can’t assume that.

We have kfold code example in

And this example looks also very useful

You would just need to form the kfold indeces based on the groups of observations you need leave out. I’ll add an issue for loo package for making an additional argument for kfold function.


Hi, sorry for being unclear again. What I mean by “improves the model” is “is model (b) ‘better’ than baseline model (a)”

a=(Un~ (1 + Entrenchment|PID) + (1|Verb), data=BOTH)
b=(Un~ (1 + Entrenchment|PID) + (1|Verb) + Entrenchment, data=BOTH)

So far I’ve been trying to answer this using loo, but given the complications we’ve discussed, I’m wondering whether there is any other possible way, such as just using AIC, WAIC, DIC etc?


Thanks Michael. I’m afraid that paper is too high level for me, but let me try and summarize what I’ve inferred from what you and others have said and - if you have a minute - perhaps you could let me know whether or not this is correct

What I’m trying to do is compare model a (a baseline model) and model b:

a=(Un~ (1 + Entrenchment|PID) + (1|Verb), data=BOTH)
b=(Un~ (1 + Entrenchment|PID) + (1|Verb) + Entrenchment, data=BOTH)

Any method of model comparison I might choose (AIC, BIC, DIC, WAIC) is an approximation. But loo validation is a better approximation. Is that correct?

If so, I guess what I’m asking is how bad would it be to use - say - DIC to compare the models instead of loo cross-validation. (I have like 50 model comparisons to run, so even if I can get loo working in the way Aki suggests, it going to be weeks of computer time to run). I.e., if DIC suggests that there IS a difference between the two models, is there a high probability I’m making a Type I error?


AIC, DIC, WAIC also consider a single pointwise prediction, and thus will not help with the groupwise prediction (which can be approximated with leave-one-group-out) which you need here.

AIC, DIC, WAIC, and LOO are methods to estimate predictive performance in pointwise case. LOO is best from these. BIC is something else.

But if you need something else than estimated predictive performance in pointwise case, then you should use something else. I don’t know any other approach than k-fold-cross-validation to estimate the predictive performance in this hierarchical setting you have (and M-open assumption).

DIC is always worse than LOO, and if LOO is not applicable to your problem then DIC is definitely out of question.

I repeat that this comparison is conceptually very easy with k-fold-cv, and now your problem is just that you would need to write a few lines of code to do the division of the data according to PID-group structure.

In the future, we’ll add support to rstanarm so that you can do the following very easily

fit1<-stan_glmer(Un~ (1 + Entrenchment|PID) + (1|Verb), data=BOTH)
fit2<-stan_glmer(Un~ (1 + Entrenchment|PID) + (1|Verb) + Entrenchment, data=BOTH)

(kfold1 <- kfold(fit1, groupby = PID)) #not yet implemented
(kfold2 <- kfold(fit2, groupby = PID)) #not yet implemented
compare(kfold1, kfold2)

I’m sorry, but we don’t have resources to add these quicker.