Reviewing papers using bespoke models - demand evidence of calibration?


This is, I think, at the very edge of topics that are appropriate for this board, but I think it’s relevant enough. And I thought I’d ask, as people will probably have more well thought out views on it than me.

A group in our department recently had a discussion about a paper where they had implemented a bespoke Bayesian model (not in Stan in this case, but it would be just as relevant here). The paper provided no evidence in the methods that the model was sampling from the correct posterior (i.e. implemented correctly), and no statement anywhere indicates that this was even checked. If you were asked to review a paper like this, where the model was coded from scratch by the authors and not using a standard front end (which will have presumably been more thoroughly tested), do you think you would be justified in demanding that the authors do a calibration, using something like simulation-based calibration?

For the record, my instinct is yes. I think I would ask for it. I can’t see that the results can be trusted without it, given the risk of coding errors that are easy to miss.


In general, yes, I think calibration should be required of any bespoke method implemented for an analysis, but at the same time I would be surprised if anyone responded positively. Simulation-based calibration aims to be as general as possible, but it can still be subtle to implement (see for example the MCMC and small deviation sections of the SBC paper) and I’ve had numerous circular arguments with practitioners in applied fields about what Bayesian computation is actually computing and why quantifiable errors on posterior expectation value estimators is critical to understanding what the posterior is actually trying to communicate.

Consequently it might be more productive if you requested their code and implemented SBC yourself, treating the analysis as a demonstration for the rest of the community and something to reference in future reviews.