Bayes factor estimation seems a long way off

I’m having some issues with computing Bayes factors for two competing models. Both models seem to fit fine and I can compute information criteria and extract parameter estimates. However, when I use bridge sampling to calculate Bayes factors, one model appears strongly preferred (estimated Bayes factor is extraordinarily, unrealistically high). This seems inaccurate, as the two models are very similar in WAIC and the extra parameter in the more complex model has very little effect.

I set the number of samples to be much higher, since I’m using bridge sampling (around 18000). The bridge sampler isn’t reaching its iteration limit (took around 64 iterations for the more complex model).

I’m just wondering what might be causing this mismatch. I’ve included the more complex model summary below, all priors are relatively weak, regularising priors. The simpler model is the same, only without the bPos parameter and bPos * Pos term in the model.

Any help is much appreciated!

  // Group-level hyperparameters
  real mu_bCons;
  real mu_bSymm;
  real mu_bAcc;
  real mu_bPos;
  real<lower=0> sigma_bCons;         // sd of group-level distribution
  real<lower=0> sigma_bSymm;         // sd of group-level distribution
  real<lower=0> sigma_bAcc;         // sd of group-level distribution
  real<lower=0> sigma_bPos;         // sd of group-level distribution
  // Subject-level parameters (raw)
  vector[N] bCons; // Constant coeff
  vector[N] bSymm; // Asymm coeff
  vector[N] bAcc; // Accuracy coeff
  vector[N] bPos; // Positivity coeff
  vector<lower=0>[N] sigma_ID;  // Vector of std of subject-level updates

model {

  // Group-level priors
  mu_bCons ~ normal(0, 1);
  mu_bSymm ~ normal(0, 1);
  mu_bAcc ~ normal(0, 1);
  mu_bPos ~ normal(0, 1);
  sigma_bCons ~ gamma(1, 0.5);
  sigma_bSymm ~ gamma(1, 0.5);
  sigma_bAcc ~ gamma(1, 0.5);
  sigma_bPos ~ gamma(1, 0.5);
  // Subject_level priors
  bCons ~ normal(mu_bCons, sigma_bCons);
  bSymm ~ normal(mu_bSymm, sigma_bSymm);
  bAcc ~ normal(mu_bAcc, sigma_bAcc);
  bPos ~ normal(mu_bPos, sigma_bPos);

  // Vectorised likelihood
  UpdateS ~ normal((bCons[S] + bAcc[S] .* Acc + bPos[S] .* Pos) .* (1 + (bSymm[S] .* Dir)), sigma_ID[S]);

1 Like
  1. Bridge sampler behaviour indicates that terms of one of the sums in the computation have non-finite mean and variance, and thus the estimate may have arbitrary big error.
  2. Even if the computation would be exact, if the models are similar, but not well specified, the Bayes factor can be overconfident with flipping behaviour with more data [2003.04026] When are Bayesian model probabilities overconfident?
  3. Even if one of the models would be such that with some parameter values it matches the data generating distribution, Bayes factor can be prior sensitive and especially with wide priors and finite data can strongly favor the simpler model.
  4. Bayes factor and WAIC are measuring different things, so they can show different results (and instead of WAIC it would be better to use PSIS-LOO as it is more accurate and has better diagnostic, see e.g. Cross-validation FAQ • loo)

Thanks @avehtari for your insight into this!

Based on all of this, it seems as though this might not be the best approach for what I’m trying to achieve (provide evidence for/against the inclusion of the additional parameter). It seems as though PSIS-LOO could achieve a similar goal, instead.

As an aside – my understanding is that for k-fold LOGO cross-validation, the elpd for each group can be taken as a measure of best-fit for the excluded group. Is my understanding flawed, here?

If the the additional parameter is (almost) independent in the posterior, then you can directly look at the posterior. If the posterior of the additional parameter is correlating, it can be difficult to interpret the posterior and then comparing the effect of the inclusion to the quantity of interest is useful, although again there can be correlations affecting how easy it is to look that. Lokking at the change in the predictive performance is a generic approach, but not necessarily the most efficient or corresponding to your quantity of interest.

Measure of the predictive performance for that group, if that data were not included to form the posterior.

1 Like