Advice on Bayes factor analysis (logml could not be estimated within maxiter)

I’m running into issues doing a Bayes factor analysis on a dataset I’ve gathered. The model I’m fitting is a pretty simple logistic regression:

correct ~ voice_n + 
    (1 + voice_n | subject) + 
    (1 | item:voice_n)

correct is a binary variable that represents accuracy. voice_n is a deviation-coded 2-level factor with values -0.5 and +0.5. subject is a subject identifier (N = 40), and item is an item identifier (N = 640). Each subject provides a rating for each item once, so the dataset has 40 × 640 = 25,600 rows. Half of the items have voice_n == -0.5, and the other half have voice_n == 0.5.

We’d like to quantify evidence for the effect of voice_n on accuracy using a Bayes factor analysis. In some cases, we expect there to be a large effect, so the Bayes factor analysis isn’t super crucial as there are alternative ways of doing hypothesis testing in those cases. But in other cases we expect no effect of voice_n, and so in those cases in particular we’d like to use Bayes factors to quantify evidence for the null hypothesis (to my knowledge, there isn’t a good alternative way to do this besides Bayes factors).

We’re fitting the model using 4 parallel chains distributed across 4 cores, and 8 threads per chain, with save_pars = save_pars(all = TRUE) and control = list(adapt_delta = 0.99). The priors are N(0, 10) for the intercept, LKJ(2) for the correlation matrix, N(0, 10) for voice_n, and N(0, 10) for the sds.

The problem that I’m running into is running bridge_sampler on the regression for the (eventual) Bayes factor analysis produces the warning message “logml could not be estimated within maxiter. Estimate may be more variable than usual.” From looking around here and elsewhere online, it seems like this is due to not having enough posterior samples.

However, the most recent model I have tried draws 2,000 warmup samples and a whopping 1,000,000 non-warmup samples per chain. The effect with the smallest ESS is the intercept, at ~130,000, with others being considerably larger. All Rhats are 1.00. This model takes about 3 days to fit, but, unfortunately still produces the “logml could not be estimated within maxiter” warning.

From what I gather, this warning isn’t innocent, and means that the results of a Bayes factor analysis cannot necessarily be trusted. My thought was that the number of random intercepts for items is just too much (since we have 640 items), and so maybe the right way to deal with this would be to simplify the model by removing those random intercepts.

However, I found that, unfortunately, Oberauer (2022) has shown that removing random slopes can compromise Bayes factor analyses, making them more susceptible to give false positives. Though his paper is specifically about random slopes, it leads me to believe that removing random intercepts would be similarly deleterious.

So, now I’m a bit stuck: we’ve reached the limits of our computational resources with a model that takes 3 days to draw 1,000,000 posterior samples per chain, and there still aren’t enough samples to support the bridge sampling algorithm properly. But simplifying the model by removing any random effects also doesn’t appear to be a good option. Does anyone have any ideas about how to proceed in this situation? I’m open to alternatives to a Bayes factor analysis, provided they can actually quantify evidence for a null effect.

I’m not fan of BF for model selection, but this is exactly the reason why we wrote [2508.14487] Bridge Sampling Diagnostics so that people could avoid wasting time and compute. It is likely that you could have stopped sampling much earlier.

It is possible that you would get the same warning if you would keep sampling for 3 million years.

In our paper we show to compute different diagnostics that can help to decide how many iterations are needed for certain accuracy or whether it is possible that even million iteration is not enough. From these diagnostics MCSE diagnostic has been included in the latest CRAN release. It may provide useful information even when you get the “logml could not be estimated within maxiter” warning. The other diagnostics are available in Giorgio’s github repo, and we’re working on further improving these other diagnostics and stabilizing bridgesampling iterations (and that’s why they are not yet in the bridgesampling CRAN release).

In general my recommendation is to start with the Stan’s default number of iterations and chains, and check the MCSE and estimate how many more iterations might be needed for the desired accuracy in model comparison, and if that number of iterations is in millions, juts give up on BF.

3 Likes

You may be interested in

This allows to run the (PolyChord, slice sampling variant of the) nested sampling algorithm on Stan models. This algorithm returns evidence estimates as well as posterior samples. If you share your whole model I can have a look.

Kind regards,

Andrew