Contradictory results for model comparison when using ELPD-LOO versus Bayes Factors

Hello all,

I have been making my first foray into using brms for mixed effects models and model comparisons, but have been running into a theoretical problem that I can’t quite wrap my head around. I was initially using bayes_factor to compare different models, but after repeated convergence warnings (even after substantially increasing the posterior samples), someone recommended I use ELPD instead. Sure enough, using loo_compare I do not get any warnings, but quite oddly and unexpectedly it contradicts the Bayes factor results. Now of course that comparison gave warnings, so it might be that it was simply wrong — but I am not convinced that this is true, since the Bayes factor values were really consistently enormous, and could be replicated through a Bayes factor approximation by using the BIC on a non-Bayesian lmer model. Furthermore, not all of these models gave warnings, and even those that did not contradicted the ELPD results.

Thus, another explanation may be that the Bayes factor comparison and the ELPD comparison simply give different results. I understand that the two measure different things, and so that this is theoretically possible. However, I do not fully understand what the implications of this would be for my results. Under what circumstances would one expect the two measures to give opposite results? Intuitively, it feels like this may be related to overfitting, and that a model that allows for more overfitting will perform better on the Bayes factor comparison, but not necessarily on the ELPD comparison. Am I thinking in the right direction? Does anyone perhaps have a simple example of a scenario in which this may occur?

Since this is more of a theoretical question, I figured that the exact technical details may not be crucial. If you have a purely theoretical answer, I’m more than interested in hearing it, and you wouldn’t need to bother with the specifics of my model. However, I’ll try to mention some details that may be relevant below. If you need more information to properly answer this question, do let me know!

In short, I am analysing experimental data from a behavioural experiment, in which there are different cue types (cue). For each stimulus (item) within the different cue types, I have obtained descriptive scores through a norming experiment (norming). I am trying to answer the question whether these norming scores can explain a difference in reaction times (RT) that we found, or whether the cue type is required (and possibly even sufficient) to explain this effect.

The relevant parts of the model specifications:

cue_formula <- brmsformula(
    RT ~ cue * matching + (...) + (cue + matching | participant) + (cue + matching | item),
    family=gaussian(link='log')
)

norming_formula <- brmsformula(
    RT ~ norming * matching + (...) + (norming + matching | participant) + (norming + matching | item),
    family=gaussian(link='log')
)

combined_formula <- brmsformula(
    RT ~ cue * norming * matching + (...) + (cue + norming + matching | participant) + (cue + norming + matching | item),
    family=gaussian(link='log')
)

Priors (weakly informative, with the RT prior based on values from an earlier study):

priors <- c(
  prior(normal(6.5, 0.5), class="Intercept"),
  prior(normal(0, 1), class="b"),
  prior(cauchy(0, 5), class="sd")
)

Models run with:

norming_model <- brm(norming_formula, data=data, prior=priors, warmup=5000, iter=105000, chains=10, cores=10, save_all_pars=TRUE)
cue_model <- brm(cue_formula, data=data, prior=priors, warmup=5000, iter=105000, chains=10, cores=10, save_all_pars=TRUE)
combined_model <- brm(combined_formula, data=data, prior=priors, warmup=5000, iter=105000, chains=10, cores=10, save_all_pars=TRUE)

Bayes factor comparisons (different variable names):

bayes_factor(mc_combined, mc_cue_only)
bayes_factor(mc_norming_only, mc_combined)

Example results:

Estimated Bayes factor in favor of mc_combined over mc_cue_only: 10369895906583968793856311296.00000
Estimated Bayes factor in favor of mc_norming_only over mc_combined: 519246339874.26733

Warning message (which I get on some, but not all, of the models):

Warning message:
logml could not be estimated within maxiter, rerunning with adjusted starting value.
Estimate might be more variable than usual.

ELPD-LOO comparisons:

mc_cue_only <- add_criterion(mc_cue_only, "loo", ndraws=5000, cores=12)
mc_norming_only <- add_criterion(mc_norming_only, "loo", ndraws=5000, cores=12)
mc_combined <- add_criterion(mc_combined, "loo", ndraws=5000, cores=12)

loo_compare(mc_cue_only, mc_norming_only, mc_combined)

Example output:

              elpd_diff se_diff
mc_combined 0.0       0.0 
mc_cue_only -7.2      21.0 
mc_norming_only -112.5      24.7

Setup information (although not everything was run on the same system, and so versions may also differ slightly):

  • Operating System: Windows 10 (19043.1466)
  • R version: 4.1.1
  • brms Version: 2.16.1
2 Likes

@avehtari

1 Like

BFs can be overconfident in case of model misspecification [2003.04026] When are Bayesian model probabilities overconfident?

Using the chain rule, BFs can be presented as predictive criterion, which asses the performance of predicting one observation given just prior, the second observation given the first, and so on (see, e.g. A survey of Bayesian predictive methods for model assessment, selection and comparison). BF thus favor models that are less flexible and have narrower prior predictive distribution, and this part can weight more than the slightly worse predictive behavior after seeing all data. LOO will estimate the predictive performance conditional on almost all data and thus can favor more flexible models that are able to learn more given enough data. If you tell the number of observations and groups, number of parameters in each model, and show the loo package output showing eldp_loo and p_loo (and Pareto-k warnings if any) I can tell what I can infer from those.

7 Likes