Due to currently slow recruitment of participants, we extended the duration of our study till the midst of April. If you come by this, please consider participating or forwarding our study invitation to anyone who might fit the candidate pool. It would be greatly appreciated.
This is a big ask in terms of time and Iām guessing people would want to know a bit more about the task. For example, are you assuming this is going to be done in some language like R or Python or in some inference scheme like BUGS or Stan or PyMC or are participants free to use whatever they want?
Also, Iām curious what you actually mean by āinference debuggingā. Do you mean when the model is misspecified or miscoded and gives you strange results, or when you have the model you want coded correctly and sampling or variational inference doesnāt work?
Hi @Bob_Carpenter, pleasure to be here and thank you for your interest :)
Yes you are absolutely correct, it is a big ask and I am happy to provide a couple more details:
For this study we will use PyMC (sorry not Stan, but results should translate) and we provide a concise introduction to PyMC at the beginning.
You are free to use the internet or other resources at your disposal during the whole time of the study
The tasks all have the same form: You get a specified model and inference algorithm and your task is to find out if something is wrong, and if so ā fix it. There could be problems with the specification of the model or problems at the inference algorithm level (e.g. bad hyperparamaters) or both.
Since this is a very open task design, the time to solve a task is different from participant to participant. All tasks have been pre-evaluated in another study to be solvable within 20 minutes. Some participants might need more time, but at a certain point we urge them to continue with the next task.
It might be that there are some helpful tools beyond classical inference analysis frameworks at your disposal
I hope this answers your questions and I am happy to go into more detail if you want to know anything else.
Thanks for the clarification. If youāre assuming a specific tool, you are going to need to adjust for how well users know that tool. Iāve been looking at PyMC for years and still find it confusing because Iāve never actually had to get real work done with it. And I find ArviZ even more confusing than PyMC. Though I find Python much less confusing than R. So for me, you wouldnāt be measuring how well I can reason about posteriors, youād be measuring how well I could work through the frustration of learning to use an unfamiliar API during a timed trial. Iād be opening ChatGPT and asking it to translate things I know from Stan into PyMC and then working by trial and error.
I assume for bullet three that thereās also dataāyou canāt validate model plus inference algorithm unless you run something like simulation-based calibration, which would take more than 2 hours of compute for non-trivial models. On the other hand, you might be able to find a problem by simulating one data set it couldnāt fit.
Itās been some time, but I figured since you expressed interest I share the results of our study here, which has just been accepted for publication. Maybe someone else finds them interesting as well.
We often saw practitioners struggle with debugging Bayesian inference problems, or even figuring out what analysis to perform and how to interpret them. One problem we came across often as well, is the time needed for running inference on complex models, especially if they include heavy computations, as is common in many domains like physics or materials science. Practitioners were often unsure for hours, or sometimes even days, about their results, creating very long debugging cycles. To combat these issues we developed a debugger for bayesian inference systems, that calculates common posterior stats to generate helpful warnings and provide inference analysis plots online (updated regularly as new samples are generated).
To evaluate the effectiveness of our idea, we simulated long-ish (1-5 minutes) running inference tasks on models that are simple enough to be comprehended within our study time. The study shows that participants are able to resolve more issues, faster when they use our tool, and most participants found that the online visualizations and warnings were useful and helped them discover issues faster. I hope these results are interesting to the community and help us all build better systems for practitioners.
[Disclaimer]: The tool should be seen as research software (e.g. some features not fully fleshed out, no stability guaranties, even if we faced no crashes during the studyā¦) and has thus far only been tested on a bespoke PPL and PyMC with HMC and MH sampling. Nevertheless, the general ideas behind this tool translate to other languages like Stan and also other inference algorithms like Variational Inference.
It doesnāt have the same set of warnings, but it is indeed incredibly useful to see some information about the posterior before sampling is completed.
Thank you very much for sharing MCMC Monitor, I was not aware of this before. From an initial quick look it seems incredibly useful and fleshed out, yet underappreciated, given the github stars and me not coming across it before, I wonder why? Given our results and everything I observed so far, this approach is clearly beneficial to end users. Do you have any ideas why this didnāt have wider adaption?
Running it requires some tools like Node that not everyone would have. This could be alleviated by doing something like compiling it into a standalone executable with something like bun, I believe.
It doesnāt have the user-friendly warnings like you have developed, so some expertise is needed to get the real value. Some of your ideas could definitely find a home there in the future!
On that topic, I thought I might also mention that nutpie shows basic information like the step size, number of divergences and number of gradient evaluations per draw during sampling in its progress bar, and it is possible to get the current trace during sampling:
This will return right away, and continue sampling in the background. You can then get the current trace with
trace = sampler.inspect()
This is not quite as detailed as MCMC Monitor (which is really cool by the way!), but itās at least already what I usually need when I work on a model.
Every time you tell me about a new nutpie feature, my motivation to maintain cmdstanpy gets damaged just a little bitā¦
I do wonder if that would be a reasonable way to get the monitoring in this study for Stan models ā it is implemented for pymc models, so maybe nutpie would be easier to adapt to?
Unfortunately I know to little of Stans internals to really judge that, but I believe it shouldnāt be too difficult. Our tool can be adapted quite quickly (at least partially).
For the weakest support with basic monitoring and basic warnings, you only need to send http requests to the tool. An initial request that contains the file, used algorithm, number of burnin samples and number of samples and then you could either send 1 request after each sample or you could send them batched every X samples (or every X seconds, doesnāt matter to the tool). For PyMC this was really easy because they have this `record` method in the PyMC backends that is called every sampler iteration. But I am sure that something similar could be done for stan. For stan specifically, given the stan programs are fully compiled c, if I am not mistaken, adding some http requests to the sampler when itās compiled with a flag like `live_debugging` should be straight forward?
The nutpie feature seems neat too, but I believe it would be easier to implement with a subscriber like/callback interface.
But I also want to add that given MCMC monitor already exists for stan, and already seems more focused on people using it, than evaluating an idea (like our tool), maybe adding the warnings and sampler stats there would be an overall easier road.
Every time you tell me about a new nutpie feature, my motivation to maintain cmdstanpy gets damaged just a little bitā¦
Please donāt stop, I really need it to run the benchmarks that show how much faster nutpie is.
About having nutpie integration with something more detailed like mcmc monitor or this new debugger, I wonder how much work it would be to just include a little webserver in nutpie, that serves a little website with this info during sampling. That would mean a bunch of new dependencies in nutpie, although some networking ones are already there due to remote zarr support to s3ā¦