WAIC, AICc and DIC differences in a Bernoulli GLMM in brms vs glmmTMB


#1
  • Operating System: WIndows 10
  • brms Version: 2.4.0

Hi all,

I’m trying to determine which variables explain the variation in a binary outcome (divorce yes/no) and I have 4 random intercepts that I need to take into account.
The data looks like that:

‘data.frame’: 321 obs. of 8 variables:
ProbDiv : num 1 0 1 0 1 1 0 0 0 0
stdAgeMale: num -0.534 -0.638 -0.534 -0.534 -0.326 …
stdQlty : num -0.317 -0.317 -0.317 -0.317 -0.317 …
stdQltySq :Class ‘AsIs’ num [1:321] 0.101 0.101 0.101 0.101 0.101 …
Year : num 2003 2006 1992 2007 2005 …
Site : Factor w/ 52 levels “Q11”,“Q12”,“Q13”,…: 23 23 23 23 23 23 23 23 23 23 …
BandM : Factor w/ 78 levels “118600236”,“118601116”,…: 59 75 14 75 59 14 14 14 59 14 …
BandF : Factor w/ 81 levels “118600238”,“118601133”,…: 72 57 11 57 57 39 39 39 72 11 …

IC_comp_data.txt (30.6 KB)

I initially used the glmmTMB package but got some convergence problems with some of the models, so I tried to switch to brms to benefit from the flexibility of Bayesian methods. However, when I compare the results from both methods (which are similar on another, simpler, dataset), the information criteria I used (AICc for glmmTMB and WAIC for brms) give me completely opposite results in this case. How is that possible?

Here I compare two models (see below for exact code of the models) at the opposite side of the IC table (one with site quality as a predictor, and one with male age):
For WAIC:

brms::WAIC(mod_brm_Qlty, mod_brm_Age)
WAIC SE
mod_brm_Qlty 128.24 17.52
mod_brm_Age 151.24 20.60
mod_brm_Qlty - mod_brm_Age -23.00 6.32

The model with quality has a smaller WAIC. The Rhat and effective sample size look good, so I think it’s mixing well. And I think I’m using fairly classic and uninformative priors …

For AICc:

MuMIn::AICc(mod_TMB_Qlty, mod_TMB_Age)
df AICc
mod_TMB_Qlty 7 180.6264
mod_TMB_Age 6 168.6190

Here the model with Age seem to fit the data much better. Importantly, I ran the same models with R2jags and the DIC were similar to AICc. Any idea what could be the issue?

Here is the code for the different models
mod_brm_Qlty <- brm(ProbDiv ~ 0 + intercept + stdQlty + stdQltySq + (1|Year) + (1|Site) + (1|BandM) + (1|BandF), data = brms_data, family = bernoulli(), prior = c(set_prior("normal(0,5)", class = "b"),set_prior("student_t(3,0,5)", class = "sd")), control = list(adapt_delta = 0.99), chains = 4, iter = 10000, warmup = 1000, thin = 5, cores = getOption("mc.cores", 4), save_all_pars = TRUE, seed = 190386)

mod_brm_Age <- brm(ProbDiv ~ 0 + intercept + stdAgeMale + (1|Year) + (1|Site) + (1|BandM) + (1|BandF), data = brms_data, family = bernoulli(), prior = c(set_prior("normal(0,5)", class = "b"),set_prior("student_t(3,0,5)", class = "sd")), control = list(adapt_delta = 0.99), chains = 4, iter = 10000, warmup = 1000, thin = 5, cores = getOption("mc.cores", 4), save_all_pars = TRUE, seed = 190386)

mod_TMB_Qlty <- glmmTMB(ProbDiv ~ stdQlty + stdQltySq + (1|Year) + (1|Site) + (1|BandM) + (1|BandF), family = "binomial", control = glmmTMBControl(optCtrl = list(iter.max = 1000, eval.max = 1000), profile = TRUE), data = brms_data)

mod_TMB_Age <- glmmTMB(ProbDiv ~ stdAgeMale + (1|Year) + (1|Site) + (1|BandM) + (1|BandF), family = "binomial", data = brms_data)

There seem to be quite big differences in the estimates between the two methods too.
Models with nest quality:
brms::fixef(mod_brm_Qlty)

Estimate Est.Error Q2.5 Q97.5
intercept -5.708118 1.856913 -10.103616 -2.9456789
stdQlty -3.859118 1.998710 -8.241969 -0.4253715
stdQltySq -2.708594 2.157818 -7.349208 1.1732692

glmmTMB::fixef(mod_TMB_Qlty)

Conditional model:
(Intercept) stdQlty stdQltySq
-3.0912 -1.8143 -0.8036

Models with age:
brms::fixef(mod_brm_Age)

Estimate Est.Error Q2.5 Q97.5
intercept -4.968025 1.315884 -8.135548 -3.098368
stdAgeMale -3.620256 1.280105 -6.432338 -1.432056

glmmTMB::fixef(mod_TMB_Age)

Conditional model:
(Intercept) stdAgeMale
-3.525 -2.906

Thanks for your help!


#2

AIC, DIC, and WAIC are all approximations of leave-one-out cross-validation under differently restrictive assumptions (AIC is the most restrictive). AIC and DIC are not recommended in particular not with Bayesian models. So it is not necessarily surprising that the results differ. An even petter approximation than WAIC is PSIS-LOO-CV via function loo. See also

Gelman, A., Hwang, J., & Vehtari, A. (2014). Understanding predictive information criteria for Bayesian models. Statistics and computing, 24 (6), 997-1016.

Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing , 27 (5), 1413-1432.


#3

Thanks for your reply.

The AIC were calculated using frequentist models. However, although I would expect some differences in the ranking of models between frequentist models using AIC and Bayesian models using WAIC, my issue here is that the ranking was almost completely opposite between my AIC and WAIC table my set of models (i.e. the ‘best’ model in one case was among the worse ones in the other case).
This is all the more surprising that AIC and WAIC were congruent for two other analyses that I conducted using the same methodology on smaller datasets (with less complex random effect structure). I am pretty sure the WAIC results are wrong in the case I presented, because exploratory analyses and plotting of ‘marginal_effects’ between the two models reveals that the age model seems to fit the data better than the nest quality model.
What aspect of my brms model could cause the WAIC to be way off in this case? Wrong priors?

I have seen and read a bit about the PSIS-LOO-CV option but I put on the side because of its computing time (I’m comparing quite a few models) and also because I wasn’t sure how to compute a ‘model averaging’ approach with it. (The few trials I did also gave me warning messages about the pareto k being too high for some points)


#4

AIC is really bad for estimating the model complexity for hierarchical models (which you call mixed models), and more complex hierarchical model you have (or more complex random effect structure as you write), more AIC will be off. So my guess is that AIC is the one which is wrong here, but of course can’t be sure, as I didn’t run the models and didn’t check the details of the models. It would help, if you post the number of observations, the total number of parameters, p_loo and Pareto k diagnostics.

If you need model average use Bayesian stacking. We have also code for that in loo package.

If you get these warnings it means also that WAIC and AIC cannot be trusted. Maybe we should compute PSIS-LOO and Pareto k diagnostic also when people call WAIC, and give the same warning, so that people don’t think that it’s safe to use WAIC just because it lacks a good diagnostic? Could you post the output from loo, p_loo and Pareto k diagnostics, and the number of observations and the total number of parameters and I can tell what we can infer from those.


#5

Good to know. Is that the case in both frequentist and Bayesian frameworks? Also, if I’m comparing models with the same random effects that differ only in terms of the fixed effects, is it really an issue if AIC is off, as I’m interested in their relative performance?

Thanks for the tip. I used the loo_model_weights function on the two models I described in my first post and it’s giving:

Method: stacking
weight
mod_brm_Age 0.335
mod_brm_Qlty 0.665

with some warnings about Pareto k diagnostic values again (not surprisingly). I compared the AICc weights for the same model ran with glmmTMB and the model with Age as a 0.994 weight!!

Ok I see. Yes, maybe it would be useful to add those warnings, as people (and me first) tend to think that if there are no warnings it’s safe to use!
My dataset has 321 observations, each model has 4 group-level effects and either 2 or 3 population-level effects.
Some output of the `loo’ function for each model are below, hope that’s what you needed.

loo(mod_brm_Age)
Computed from 7200 by 321 log-likelihood matrix
         Estimate   SE
elpd_loo    -81.9 11.2
p_loo        41.8  6.5
looic       163.8 22.4
Pareto k diagnostic values:
                         Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     175   54.5%   1157      
 (0.5, 0.7]   (ok)       110   34.3%   138       
   (0.7, 1]   (bad)       36   11.2%   67        
   (1, Inf)   (very bad)   0    0.0%   <NA>
loo(mod_brm_Qlty)

Computed from 7200 by 321 log-likelihood matrix

        Estimate   SE
elpd_loo    -80.6 11.4
p_loo        53.1  8.4
looic       161.3 22.8
------
Monte Carlo SE of elpd_loo is NA.

Pareto k diagnostic values:
                        Count Pct.    Min. n_eff
(-Inf, 0.5]   (good)     113   35.2%   2932      
(0.5, 0.7]   (ok)       155   48.3%   325       
  (0.7, 1]   (bad)       47   14.6%   22        
  (1, Inf)   (very bad)   6    1.9%   7 

#6

Hi @tmerkling,

The additional information makes this much more clear.

From your earlier post I see Site, BandM and BandF have 52, 78 and 81 levels, and there are years at least between 1992 and 2007, and including population level effects would make more than 230 parameters. That is a lot compared to 321 observations with Bernoulli model.

If I read brms function call correctly, you are using quite wide prior for the parameters which means that your prior is saying that with 95% probability the predictive probabilities are smaller than 0.01 or larger than 0.99, so you are assuming very accurate predictions.

This wide prior, small data compare to the number of parameters make the model very flexible and many observation are highly influential. This means also that you are far away from the asymptotic regime for AIC (or even AICc).

The prior is shrinking a bit which is reflected by p_loo being much smaller than the total number of parameters.

High Pareto k values and Bernoulli model hints that some of the groups have only a few observations, and then those observations are highly influential, that is the posterior with them or without is very different and then WAIC and PSIS-LOO fail.

You should use narrower prior or allow the prior scale to be unknown so that the prior scale is learned from the groups having more observations and then benefits the groups with less observations. I don’t remember how to do that in brms, but I know you can do it and maybe @paul.buerkner can post the link to the relevant instructions.


#7

Thanks, Aki for the detailed response. I am not sure what you mean with your last paragraph, though, if not multilevel modeling itself?

To me, it seems as if you most definitely need stronger priors right away in particular to the standard deviations of the varying effects. You will find detailed instructions about that via ?set_prior.


#8

I assume set_prior("normal(0,5)", class = "b") sets normal(0,5) prior on all coefficents, while I suggest using normal(0,sigma) where sigma is unknown, so there is no need to guess a value for it.


#9

Ah, I see. You mean a hierarchical prior on the main regression coefficients. Yes, this is possible, although more advanced to realize in brms (see https://github.com/paul-buerkner/brms/issues/459).

In order to use this prior, please make sure that all your predictors are on the same scale (e.g., standardized).


#10

Thanks @avehtari and @paul.buerkner for your answers!

Hmm I (naively) thought that the number of parameters is the number of regression parameters and standard deviations of the group-level effects. I didn’t know that the number of levels of each group-level effects counted as parameters … Indeed 230 parameters is a lot for 321 observations! So if I understand correctly, model inference will be better if there are less levels (as long as there is enough to estimate a standard deviation) per grouping factor (with more observations per level) than the reverse?

Does that apply to the population-effect prior or to the prior for the SDs of group-level effects?
And so if I understand correctly, my prior (or both priors?) was too uninformative and the model doesn’t have enough data to accurately estimate the different parameters?

Thanks @paul.buerkner, my predictors are already standardized so I’ll give it a try! Alternatively, would using a normal prior on the main regression coefficients with a smaller standard deviation be another solution? How could I then diagnostic whether the prior is informative enough or not?


#11

Since you only seem to have few predictors, the hierarchicial priors is probably too much. A narrower normal prior will likely do. Still, you need to set more informative priors on the varying effects’ SDs.


#12

Model inference is easier with less levels, but model is likely to be better with more levels and good hierarchical priors because then you are borrowing information between groups (kind of partial pooling).

It should be more important for group-level effects as there were more parameters in there.

You are on the edge, and some of the parameters are weakly identified and Bernoulli distribution makes it more difficult. It might be that better hierarchical model would solve the problem.


#13

OK, thanks, I’ll give it a try with more informative priors on the SDs. Should I keep a half student-t prior or would a half-Cauchy maybe be better?

What characteristic of the Bernoulli distribution makes it more difficult? Would it be as difficult with any other non-Gaussian distribution?


#14

I would go for half-student-t or even half-normal. Half-cauchy will likely have too fat tails.


#15

Related to that I said that you are on the edge, the likelihood has the mode at the edge. You can plot the posterior given weak prior and observation y=0 or y=1 and you can see that the posterior changes a lot. You would like to borrow more information across groups.


#16

Thank you @paul.buerkner and @avehtari for your advice!