Running bayes factor on measurement error models is producing weird results

I have two models that I want to compare using bayes_factor. The first model predicts a continuous variable (y) using two categorical predictors (x, z) and their interaction. This is data from a psycholinguistic study, so there are random slopes and intercepts for subjects and items.

y ~ x * z +
    (1 + x * z | subject) +
    (1 + x * z | item)

The second model adds a continuous predictor, f, which is an estimate of a frequency-related measure. It is measured with a known uncertainty, sd.f.

y ~ x * z + me(f, sd.f) +
    (1 + x * z + me(f, sd.f) | subject) +
    (1 + x * z + me(f, sd.f) | item)

I fit these models to a dataset with ~2500 observations of y. For model 1, I used 10k iterations; and for model 2, I used 20k iterations. I made sure to set save_pars = save_pars(all = TRUE) to be able to run bayes_factor.

When I run bayes_factor(fit1, fit2), it takes a long time, and says that it was unable to estimate the logml within maxiter.

I’ve seen in other threads where people report the same warning message that the solution is to just run more iterations. I tried running 40k iterations for model 2, and while I didn’t see the warning message, I’ve been getting bizarre and huge estimates of the Bayes factor that don’t make sense to me (ranging from actually getting Inf to getting very very large values in the ~1B range), and I’ve noticed that the model with f runs up to 1000 iterations (the default maxiter for bridge_sampler).

I also tried this simpler model using 10k iterations for comparison:

y ~ x * z + f +
    (1 + x * z + f | subject) +
    (1 + x * z + f | item)

In this case, bayes_factor(fit1, fit3) produced something reasonable, and didn’t give any error message. But this model isn’t great, since it doesn’t take into account the known measurement error of f.

Is this just again a case of needing many more samples for the measurement error model (100k?), or is there something special about the measurement error model that would require a different solution? Do I need to set save_pars = save_pars(all = TRUE, latent = TRUE)?

I haven’t been able to find anyone specifically discussing this in the context of measurement error models, so I wanted to see if running more iterations would even help. I’m in the process of trying different things, but it takes a long time to fit the models, so if there’s someone who knows the answer so I don’t have to waste lots of time on dead ends, I would be incredibly grateful!

Some additional experimentation has shown me that it might be a combination of setting latent = TRUE and needing to set iter higher.

In particular, I was also trying a model just using f (with measurement error):

y ~ me(f, sd.f) +
    (1 + me(f, sd.f) | subject) +
    (1 + me(f, sd.f) | item)

I fit this with 10k samples/chain, and save_pars = save_pars(all = TRUE, latent = TRUE), and I do not get a warning message for it when running bayes_factor/bridge_sampler. When I fit it without latent = TRUE in save_pars, I did get the warning message. So it looks like latent = TRUE indeed helps for measurement error models.

However, the more complex model that contains x, z, and f (with measurement error) still produces the warning message at 20k samples/chain. So my guess is that while setting latent = TRUE helped the simpler model, more samples will still be needed for the more complex model in addition to that. A bit frustrating since it already takes a long time to fit the more complex model, it takes up ~1.1 GB of disk space with 20k samples/chain, and I’m likely to run out of RAM even with 64 GB. If there’s a better solution it’d definitely be helpful, but that might just be the way things are.

Further testing has shown that the more complex model still produces the warning with iter = 30000 in the call to brm. I also managed to fit it with iter = 40000, but ran out of RAM during the call to bridge_sampler (I have 64 GB). Looks like I might need to adjust my workflow and run this on a cluster with more RAM available if more samples are needed.

Indeed, running for 40,000 iterations worked for the more complex model with latent = TRUE. Lucky I can run this on a cluster where there’s enough RAM for it. Hopefully this info might end up helping anyone else running into similar issues.

I think this calculation becomes very difficult due to all the random effects in the model, and I would worry that you will get a very different answer every time you run bayes_factor(). To compute the Bayes factor, we need to integrate the likelihood across all model parameters, and I believe that all the random effects are being treated as additional parameters here. There are at least 4 * (n_subjects + n_items) random effects in the first model that you mention, and possibly many more if x and z have many categories.

To get a better approximation of the Bayes factor, it might be possible to make use of the likelihood that is marginal across random effects. This likelihood has an analytic form for Gaussian models. It will be a high-dimensional multivariate normal so maybe still difficult to work with, though there might be some anova-like simplifications if subjects are crossed with items. In general, it is a difficult problem, and I am not sure there is a simple way to get a good BF approximation here.

I see; that makes sense. Do you have any thoughts on the best way to compare different models in this case, then? We were originally using models without measurement error (since I hadn’t been aware this was an option), and LOO cross-validation by comparing the elpd_diffs; but a reviewer recommended we use measurement error models, and compare them using Bayes factors instead.

One possibility would be to focus on the largest model, because (I think) the other models are special cases of the largest model. Then the posterior distribution of the largest model would seem to provide information about the plausibility of those special cases.

The loo metrics have a similar issue here, where your comparison depends on whether or not the random effects are treated as parameters. It is possible to reach different conclusions depending on how the random effects are handled.

I think this is one of those situations that often arise in applications, where the time/effort required to get an accurate computation maybe outweighs the need for the computation. If you went with bayes_factor(), you might at least run it a few times to see how much the value varies. Though I realize that might take awhile.

The idea is to see whether adding a predictor is helping the model sufficiently to justify its inclusion, so we’re comparing smaller models to larger ones. But running bayes_factor() a few times to get a sense of the stability should be possible, since we can run this on a cluster and even for the larger models it doesn’t take all that long.

Part of my concern was whether the failure to estimate the logml within maxiter would make the distribution we get from running multiple times itself much more unreliable. If we see a fairly stable distribution of estimates with several repetitions (I’m thinking 100, but that might be ambitious even with the cluster available), even if we get the warning messages, would that be mostly okay? Or would the estimates be so unreliable that a stable-looking distribution itself couldn’t be trusted in that case?

I am not sure about the maxiter issue, you might contact the bridgesampling developers about that (brms is calling the bridgesampling package here).

Regarding the model comparisons again, if your predictor is not helping, then the large model would include some fixed effects estimates near 0 and perhaps some wider posterior intervals. That doesn’t seem so bad, and would be a reason to just focus on the larger model.

I haven’t heard back from the bridgesampling devs yet, but I was able to run some tests on my own, and at least with repetitions = 100, a model for which it was able to estimate the logml within maxiter had a different distribution of logmls than a model for which it couldn’t.

I ran one model with 20k samples/chain (low enough that logml could not be determined within maxiter), and one model with 40k samples/chain (high enough that it could) for 100 repetitions. Here are the results of a simple t.test on the logmls:

        Welch Two Sample t-test

data:
logml20k and logml40k
t = -98.946, df = 152.32, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -31.03350 -29.81846
sample estimates:
mean of x mean of y
-23514.54 -23484.11

The 20k model has a significantly lower logml than the 40k model. There’s also more variability in the 20k model compared to the 40k model (as the warning message advises):

> quantile(logml20k)
       0%       25%       50%       75%      100%
-23520.55 -23516.37 -23514.78 -23513.09 -23504.69

> quantile(logml40k)
       0%       25%       50%       75%      100%
-23487.10 -23485.22 -23484.16 -23483.04 -23480.13

Thankfully, I was able to get the larger model to work without warning when using 100k samples/chain, so even though that takes a while it seems like it’ll be best to stick to that for better estimates.