Interpreting elpd_diff - loo package

bgoodri · September 19, 2017, 7:51pm

rstanarm, rethinking, and brms for that matter all call the sampling function in the rstan package. The rstanarm package comes with compiled models, while rethinking and brms generate Stan code at runtime that has to be compiled. There are also differences in the default priors used in the three packages.

Benambridge · September 19, 2017, 7:51pm

Sorry, still no luck!

Sanity_M_LOO = LOO(Sanity_M, pointwise = TRUE)
Constructing posterior predictions
[ 1000 / 1000 ] 200 / 1000 ]
Sanity_Baseline_M_LOO = LOO(Sanity_Baseline_M, pointwise = TRUE)
Constructing posterior predictions
[ 1000 / 1000 ] 200 / 1000 ]
rethinking::compare(Sanity_M_LOO, Sanity_Baseline_M_LOO)
Error in nobs.default(X[[i]], …) : no ‘nobs’ method is available
No WAIC method for object of class ‘numeric’. Returning AIC instead.
Error in UseMethod(“logLik”) :
no applicable method for ‘logLik’ applied to an object of class “c(‘double’, ‘numeric’)”

Would I be better off trying to convert the model to a format the loo package can understand, rather than using the loo/compare function of rethinking? (that’s what I was trying to do initially - but it seems incorrectly!)

avehtari · September 19, 2017, 7:54pm

I checked this, and it’s possible that you were computing the correct thing, but I would not recommend using that code publicly because it’s super confusing to have loo(WAIC(…)).

I think Un~ (1 + Entrenchment |PID) + (1|Verb) + Entrenchment should work with stan_glmer function in rstanarm, too.

Comparing

Un~ (1 + Entrenchment |PID) + (1|Verb) + Entrenchment
and
Un~ (1 + Entrenchment |PID) + (1|Verb)

is difficult. Even with direct estimate, depending on the priors both Entrenchment |PID and Entrenchment can explain same things unless Entrenchment |PID is constrained to have zero mean (which I guess it’s not happening). With predictive performance estimates, it’s even more likely that Entrenchment |PID can predict almost as well as Entrenchment |PID + Entrenchment, and big difference could be seen only if each group would very few observations and there would be many groups. Things get more complicated if Entrenchment correlates with other covariates.

bgoodri · September 19, 2017, 7:55pm

rstanarm::stan_lmer(Un~ (1 + Entrenchment |PID) + (1|Verb) + Entrenchment, data=BOTH) should be fine, followed by loo and compare_models.

Benambridge · September 19, 2017, 8:09pm

OK thanks! So what this means conceptually is that, when the leave one out method is trying to predict the value of the held-out datapoint, it can do a very good job just by considering the by-participant variance in the effect of entrenchment; so much so that adding the fixed effect of entrenchment itself doesn’t really help?

avehtari · September 19, 2017, 8:34pm

I get again confused by fixed and random. See here Looic and elpd_diff (rstanarm model) - #2 by avehtari reasons for not to use those and what would be better terminology.

If you have several observations from a participant(?) indexed by PID, then after removing just one observation, it’s still easy to predict that based on the other observations from that participant. If you want to examine the effect of the population mean, then you would need to leave-out all observations for one participant at time. Then you could see the difference whether the population mean is zero (model Un~ (1 + Entrenchment |PID) + (1|Verb)) or non-zero (model Un~ (1 + Entrenchment |PID) + (1|Verb) + Entrenchment).

To implement leave-one-participant-out cross-validation you need to do some coding.

@jonah for rstanarm it would be great to have option for kfold to make the data division based on given variable, like in this case PID, to make leave-one-group-out easy.

Benambridge · September 19, 2017, 8:47pm

Aha that makes so much sense! Thanks!!

richard_mcelreath · September 20, 2017, 6:02am

I’m late to this party. Maybe better to start over with how to call WAIC and LOO and compare() from within rethinking.

map2stan objects have methods for WAIC and LOO. These methods extract pointwise log likelihoods and then compute WAIC and PSIS-LOO respectively. The pointwise=TRUE argument returns pointwise rather than summed.

The rethinking::compare function takes model fits (map2stan or stanfit objects), not pointwise lists of log likelihoods. If you want it to use LOO, then use the func=LOO argument.

WAIC and LOO work with stanfit as well, provided the model defines a log_lik vector. Example in ?WAIC.

Note that the LOO support in rethinking is not advertised nor entirely tested. It just calls the loo package, though, so probably no surprises.

Benambridge · September 20, 2017, 7:06am

Thanks very much Richard - I’m starting to think that the comparison was probably in fact working correctly (though I will double check by doing everything within rethinking) - the lack of a difference is probably due to Aki’s point about a participants’ own performance being the best predictor of their held-out datapoint.

Aki, presumably this is also true at the level of (1|Verb) - participants are proving ratings for 160 different verbs. I.e., presumably the best predictor of a rating for that verb is all other ratings for that verb (as indexed by 1|Verb), rather than the predictors I’m interested in? If this is the case, then presumably I need to hold out both (a) all the other ratings for that participant and (b) all the other ratings for that verb, for each held-out datapoint?

Thanks everyone!
Ben

avehtari · September 20, 2017, 7:19am

Yes.

avehtari · September 20, 2017, 7:30am

As we discussed in the blog, it would be nice if PSIS-LOO would be advertised, as it has the Pareto diagnostic. Every time someone has some problem with WAIC, we ask to report Pareto diagnostics from loo function, so that we have better idea where’s the problem. (I did also investigate similar diagnostics for WAIC, but they were not working well).

Benambridge · September 20, 2017, 2:22pm

Hi everyone

Thanks again for all your help! One final PS if I may…

Is there any way to do model comparison with Bayesian models that does NOT use leave-one-out validation? I have read elsewhere that global measures like AIC, BIC, DIC and WAIC are not suitable, but is there any other alternative?

Aki’s solution makes sense in principle; but in practice, I think it’s going to be well beyond my coding abilities. Also, given that my goal in doing model comparison is just to find out whether a particular predictor improves the model, the predictive ability of the rival models (as measured by loo) is perhaps not exactly what I most need to know?

Thanks again!
Ben

betanalpha · September 20, 2017, 2:35pm

It is important to recognize that all of these measures are approximations to a score that compares some predictive distribution to a true data generating process, with various levels of approximation. For example, AIC approximates how “close” a point predictive distribution is to the true data generating process, BIC approximates the prior predictive, and WAIC -> DIC approximate the posterior predictive. Cross validation of various forms can also be applied to quantifying all three of these predictive distributions.

Consequently your only two choices are which distribution will you use to make predictions and then how will you estimate “closeness”, otherwise it’s all the same.

See https://arxiv.org/abs/1506.02273, Section 3, for more detail, although admittedly a bit high level. Unfortunately I don’t think we have a more cursory treatment of these matters yet.

avehtari · September 20, 2017, 5:56pm

You write yourself “just to find out whether a particular predictor improves the model”. Improves compared to what? If you want to know improvement you need to define the baseline. You could use just one model and look at the marginal posterior of the effect if you would know that the effect is independent, but it seems that in your case you can’t assume that.

avehtari · September 21, 2017, 7:34am

We have kfold code example in [1507.04544] Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC

And this example looks also very useful
http://htmlpreview.github.io/?https://github.com/stan-dev/stancon_talks/blob/master/2017/Contributed-Talks/07_nicenboim/kfold.html

You would just need to form the kfold indeces based on the groups of observations you need leave out. I’ll add an issue for loo package for making an additional argument for kfold function.

Aki

Benambridge · September 22, 2017, 9:02am

Hi, sorry for being unclear again. What I mean by “improves the model” is “is model (b) ‘better’ than baseline model (a)”

a=(Un~ (1 + Entrenchment|PID) + (1|Verb), data=BOTH)
b=(Un~ (1 + Entrenchment|PID) + (1|Verb) + Entrenchment, data=BOTH)

So far I’ve been trying to answer this using loo, but given the complications we’ve discussed, I’m wondering whether there is any other possible way, such as just using AIC, WAIC, DIC etc?

Thanks!
Ben

Benambridge · September 22, 2017, 9:13am

betanalpha:

It is important to recognize that all of these measures are approximations to a score that compares some predictive distribution to a true data generating process, with various levels of approximation. For example, AIC approximates how “close” a point predictive distribution is to the true data generating process, BIC approximates the prior predictive, and WAIC -> DIC approximate the posterior predictive. Cross validation of various forms can also be applied to quantifying all three of these predictive distributions.

Consequently your only two choices are which distribution will you use to make predictions and then how will you estimate “closeness”, otherwise it’s all the same.

See [1506.02273] A Unified Treatment of Predictive Model Comparison, Section 3, for more detail, although admittedly a bit high level. Unfortunately I don’t think we have a more cursory treatment of these matters yet.

Thanks Michael. I’m afraid that paper is too high level for me, but let me try and summarize what I’ve inferred from what you and others have said and - if you have a minute - perhaps you could let me know whether or not this is correct

What I’m trying to do is compare model a (a baseline model) and model b:

a=(Un~ (1 + Entrenchment|PID) + (1|Verb), data=BOTH)
b=(Un~ (1 + Entrenchment|PID) + (1|Verb) + Entrenchment, data=BOTH)

Any method of model comparison I might choose (AIC, BIC, DIC, WAIC) is an approximation. But loo validation is a better approximation. Is that correct?

If so, I guess what I’m asking is how bad would it be to use - say - DIC to compare the models instead of loo cross-validation. (I have like 50 model comparisons to run, so even if I can get loo working in the way Aki suggests, it going to be weeks of computer time to run). I.e., if DIC suggests that there IS a difference between the two models, is there a high probability I’m making a Type I error?

Thanks
Ben

avehtari · September 22, 2017, 9:21am

AIC, DIC, WAIC also consider a single pointwise prediction, and thus will not help with the groupwise prediction (which can be approximated with leave-one-group-out) which you need here.

avehtari · September 22, 2017, 9:27am

AIC, DIC, WAIC, and LOO are methods to estimate predictive performance in pointwise case. LOO is best from these. BIC is something else.

But if you need something else than estimated predictive performance in pointwise case, then you should use something else. I don’t know any other approach than k-fold-cross-validation to estimate the predictive performance in this hierarchical setting you have (and M-open assumption).

DIC is always worse than LOO, and if LOO is not applicable to your problem then DIC is definitely out of question.

avehtari · September 22, 2017, 9:37am

I repeat that this comparison is conceptually very easy with k-fold-cv, and now your problem is just that you would need to write a few lines of code to do the division of the data according to PID-group structure.

In the future, we’ll add support to rstanarm so that you can do the following very easily

fit1<-stan_glmer(Un~ (1 + Entrenchment|PID) + (1|Verb), data=BOTH)
fit2<-stan_glmer(Un~ (1 + Entrenchment|PID) + (1|Verb) + Entrenchment, data=BOTH)

(kfold1 ← kfold(fit1, groupby = PID)) #not yet implemented
(kfold2 ← kfold(fit2, groupby = PID)) #not yet implemented
compare(kfold1, kfold2)

I’m sorry, but we don’t have resources to add these quicker.

Topic		Replies	Views
Loo comparison in reference to standard error General loo	10	3068	May 1, 2018
Interpreting output from compare() of loo Modeling loo , interpret-results	7	4179	March 27, 2024
Quick examples of loo() interpretation Modeling loo	11	1793	July 3, 2020
Quantifying Uncertainty with the LOO-CV criterion Modeling techniques , fitting-issues , specification , loo	10	136	March 31, 2025
SE of elpd_loo - loo package Modeling loo	2	786	July 6, 2018

Interpreting elpd_diff - loo package

Related topics