Looic and elpd_diff (rstanarm model)


Hi everyone,

I fitted a stan_glmer model to my data and wanted to use loo to compare models with or w/o a fixed effect of interest. This is my full model:

post_rep_full <- stan_glmer(Resp13 ~ (Distance * Context ) + 
                     (Distance * Context || Subject), 
                   data = myData, family = binomial(),
                   iter=4000, chains=6, 
                   prior_intercept = normal(0,7.5),
                   prior = normal(0,2.5), QR = TRUE) # i removed the int term. the model does it. 

and then I dropped my strongest fixed effect (this fixed effect had a coefficient of 0.56 [95% HDI = [0.44-0.68]]), keeping the random-effects structure untouched:

post_rep_no_dist <-stan_glmer(Resp13 ~ (Context + Distance:Context ) + 
                                   (Distance * Context || Subject), 
                                 data = myData, family = binomial(),
                                 iter=4000, chains=6, 
                                 prior_intercept = normal(0,7.5),
                                 prior = normal(0,2.5), QR = TRUE) 

Then, I calculated loo for each model

loo_full<- loo(post_rep_full)
loo_no_distance = loo(post_rep_no_dist)

and comparing the models, I got

elpd_diff        se 
     -0.4       1.1 

suggesting (if I understand correctly) that we are indifferent between the models, with an unreliable preference to the full model. How come that the full model isn’t preferred in this case? I received a similar result when I dropped another fixed effect with a coefficient around zero.

this is what happens when I drop each single fixed effect and compare all models. All the values seem very similar, I don’t know why…

                      looic    se_looic elpd_loo se_elpd_loo
post_rep_full            20498.0    117.7 -10249.0     58.8   
post_rep_no_dist         20498.7    117.8 -10249.3     58.9   
post_rep_no_context      20498.9    117.6 -10249.4     58.8   
post_rep_no_interaction  20499.3    117.7 -10249.7     58.8   
                        p_loo    se_p_loo
post_rep_full               71.7      0.8
post_rep_no_dist            73.1      0.8
post_rep_no_context         72.2      0.8
post_rep_no_interaction     71.8      0.8


We avoid using the common names fixed effects and random effects, which are not only misleading but also defined differently across the various fields in which these models are applied. We instead favor the more accurate (albeit more verbose) intercepts/coefficients that are common across groups and intercepts/coefficients that vary by group. See more at

I’m not fluent in the lmer formula syntax, but if I read it correctly, in your second model you have group specific distance and context, so the model still has those available. Now is seems that groups are so different that common term is not useful and the groups have so much data that the group specific terms are very accurate. p_loo being around 72 also supports tha hypothesis. See this http://www.stat.columbia.edu/~gelman/research/published/final_sub.pdf for related discussion,


Model Comparison between 2 models
Interpreting elpd_diff - loo package

thanks Aki,

  1. Re the fixed-mixed terminology - I’m convinced.
  2. After reading Wang & Gelman (2014) I understand your suggestion and it describes my data well. I indeed have 24 groups (subjects) with ~750 observations per group.
    I am now writing up my results. To report my conclusion on group-common coefficients, I understand that in my case using ELPD isn’t a sensitive enough measure, right? should I just report the median and 95% HDI of relevant posteriors (of my chosen model) without comparing models based on dropping single group-common coefficients (like people sometimes do in lme4 and report LRT stats)?


Or it’s sensitive enough to say that you can get practically same predictions with just group specific terms.

You could check how much of the variation the group-common can explain.
If using elpd , you could check how much difference there is between no-covariate model and common-only model, and how much there is difference between common-only model and group model. Instead of elpd, you compute, e.g., MSE which is easier to interpret and from MSE’s you could infer the percentage of variation explained by each model.



Hi Aki, thanks again. I calculated the MSE for the three models that you suggested, and indeed a model relying on group specific parameters only minimized MSE, as we expected.

[1] 0.2155973
[1] 0.2496214
[1] 0.193406

Yet, my key question isn’t whether I can predict the data equally well using only group-specific parameters. The core of my project is to show that context predicts my binary data. How would you show that?
Is it legit to compare the full model to a model without both group-specific context coefficients and the group-common context coefficient? using ELPD? as far as I know, in lme4 the common practice is to keep the “random structure” (group-specific coefficients and intercept) unchanged when comparing models, only dropping a single group-common coefficient at a time.


Are these LOO-MSEs?
(although probably doesn’t matter since you have so much data)

Could you tell more about your project? It’s still bit abstract for me.

Certainly it’s legit if that’s what you want, but as I said I don’t yet understand the project yet, so I don’t know yet what you really want, but maybe you can tell more?



No. I just manually calculated MSE out of the residuals of each model:

 mse <- function(model) 

Thanks for asking!
The main question of the project is - is the perception of ambiguous objects affected by the context in which they appear, when this context is rendered subliminal via psychophysical techniques (i.e., we “mask” it, so the eyes receive the visual input but still subjects are unaware of it).
Subjects are classifying a series of six ambiguous objects which can be perceived either as the letter B, or as the number 13 (binary response).
Independent variables: The ambiguous objects differ in what I call distance (is it physically more B or more 13?; a continuous predictor with 6 levels). The objects are also embedded in a context (letters-“ABC” or numbers-“121314”; binary predictor), and then, there is the two-way interaction between them.
And i’ll mention that I have ~750 trials per subject, (i.e., dozens of reps of each distance X context combination).

This illustrates my stimulus, (except that I present it only with letters or only with numbers on each trial, not like in this picture).


So most important to know whether context reliably modulates performance or not. I have the group-common coefficients to show this:

Is there another way to show whether context modulates performance using model comparison?

Thanks a lot for you help, I really appreciate it.


Thanks for the description, now it’s more clear. I’m in a hurry, but since I’ll be away next three days, a quick response:

To test, remove the context both from the common and group specific parts and check whether the elpd and classification accuracy changes (I mentioned rmse, but now realized you have binary response). This demo https://github.com/avehtari/BDA_R_demos/blob/master/demos_rstan/diabetes.Rmd (sorry it’s not rendering well in github, but you can clone the repo and run it in Rstudio) shows also how to compute LOO classification accuracy and LOO AUC.

In addition I recommend to test nonlinear function for distance. stan_gamm might be able to this (with splines), but if not, then brms can do it (with splines or gps).



Hi Aki! thanks.

I did that, and found your demo very helpful. I found a difference in ELPD:

> compare(loo_full, loo_no_context)
elpd_diff        se 
    -33.9       8.2 

However, using your demo I did not find a difference in LOO balanced classification accuracy (both were exactly 0.31) and LOO AUC.

Do you have any ideas why is that?

Many thanks, again!


§[quote=“Dan_Biderman, post:9, topic:1450”]
I found a difference in ELPD:

I realized that you might want look also the predictive performance if you leave out a whole group at time. Unfortunately you can’t do this with rstanarm easily. It’s on our todo list to allow a specific data division in kfold function, but meanwhile you would need to use rstan and have your own cross-validation wrapper. There are some examples (maybe @jonah remembers where).

If change of the model doesn’t affect the ordering of the probabilities (in this case I guess that the probabilites are determined mostly by the distance), then classification accuracy and AUC do not change. The change of the model can still make the probability estimates more accurate which can change the ELPD.



Hi Aki, thanks.

Indeed the order of probabilities is determined mostly by distance and your explanation sounds convincing. I think that for the time being, I’ll stay with ELPD, and for my parameter of interest, as you suggested, I’ll drop it both from the subject-specific and subject-common levels and compare with the full model.