Participation Needed: Probabilistic Programming Study

Hello everyone,

we are currently conducting a study on Bayesian inference debugging and would like to invite you to participate in our study.

It would be greatly appreciated and could help to further knowledge in the area of debugging inference and analyzing inference results.

The study will be conducted fully online and to take part you only need:

  • To have worked with probabilistic programming languages or Bayesian inference in some form before (at any level)
  • A stable internet connection
  • Zoom
  • A PC/laptop or similar device
  • A GitHub account

The study will take at most 2 hours of your time (90 minutes or less are likely) and you will be recorded during most parts.

It involves solving 3 inference debugging tasks and 2 questionnaires.

If you are interested and have time between the February 21th and March 6th, please reach out to nathanael.nussbaumer@tuwien.ac.at.

Thank you in advance and looking forward to your participation!

4 Likes

Due to currently slow recruitment of participants, we extended the duration of our study till the midst of April. If you come by this, please consider participating or forwarding our study invitation to anyone who might fit the candidate pool. It would be greatly appreciated.

Thank you very much.

Hi @nattube and welcome to the Stan Forums.

This is a big ask in terms of time and I’m guessing people would want to know a bit more about the task. For example, are you assuming this is going to be done in some language like R or Python or in some inference scheme like BUGS or Stan or PyMC or are participants free to use whatever they want?

Also, I’m curious what you actually mean by ā€œinference debuggingā€. Do you mean when the model is misspecified or miscoded and gives you strange results, or when you have the model you want coded correctly and sampling or variational inference doesn’t work?

Hi @Bob_Carpenter, pleasure to be here and thank you for your interest :)

Yes you are absolutely correct, it is a big ask and I am happy to provide a couple more details:

  • For this study we will use PyMC (sorry not Stan, but results should translate) and we provide a concise introduction to PyMC at the beginning.

  • You are free to use the internet or other resources at your disposal during the whole time of the study

  • The tasks all have the same form: You get a specified model and inference algorithm and your task is to find out if something is wrong, and if so → fix it. There could be problems with the specification of the model or problems at the inference algorithm level (e.g. bad hyperparamaters) or both.

  • Since this is a very open task design, the time to solve a task is different from participant to participant. All tasks have been pre-evaluated in another study to be solvable within 20 minutes. Some participants might need more time, but at a certain point we urge them to continue with the next task.

  • It might be that there are some helpful tools beyond classical inference analysis frameworks at your disposal

I hope this answers your questions and I am happy to go into more detail if you want to know anything else.

Thanks for the clarification. If you’re assuming a specific tool, you are going to need to adjust for how well users know that tool. I’ve been looking at PyMC for years and still find it confusing because I’ve never actually had to get real work done with it. And I find ArviZ even more confusing than PyMC. Though I find Python much less confusing than R. So for me, you wouldn’t be measuring how well I can reason about posteriors, you’d be measuring how well I could work through the frustration of learning to use an unfamiliar API during a timed trial. I’d be opening ChatGPT and asking it to translate things I know from Stan into PyMC and then working by trial and error.

I assume for bullet three that there’s also data—you can’t validate model plus inference algorithm unless you run something like simulation-based calibration, which would take more than 2 hours of compute for non-trivial models. On the other hand, you might be able to find a problem by simulating one data set it couldn’t fit.

Yes that’s import and it is something we try to capture through our questionnaires.

honestly, that is a valid strategy :)

Yes there is data, I just didn’t mention it there because there won’t be issues with the prepared data.

If all of this catched your interest and you want to give it a try I’d be happy if you have time to participate.

I’m curious about the result, but two hours is a lot of time!

Hi,

It’s been some time, but I figured since you expressed interest I share the results of our study here, which has just been accepted for publication. Maybe someone else finds them interesting as well.

[2510.26579] Online and Interactive Bayesian Inference Debugging

TLDR/Whats it about:

We often saw practitioners struggle with debugging Bayesian inference problems, or even figuring out what analysis to perform and how to interpret them. One problem we came across often as well, is the time needed for running inference on complex models, especially if they include heavy computations, as is common in many domains like physics or materials science. Practitioners were often unsure for hours, or sometimes even days, about their results, creating very long debugging cycles. To combat these issues we developed a debugger for bayesian inference systems, that calculates common posterior stats to generate helpful warnings and provide inference analysis plots online (updated regularly as new samples are generated).

To evaluate the effectiveness of our idea, we simulated long-ish (1-5 minutes) running inference tasks on models that are simple enough to be comprehended within our study time. The study shows that participants are able to resolve more issues, faster when they use our tool, and most participants found that the online visualizations and warnings were useful and helped them discover issues faster. I hope these results are interesting to the community and help us all build better systems for practitioners.

[Disclaimer]: The tool should be seen as research software (e.g. some features not fully fleshed out, no stability guaranties, even if we faced no crashes during the study…) and has thus far only been tested on a bespoke PPL and PyMC with HMC and MH sampling. Nevertheless, the general ideas behind this tool translate to other languages like Stan and also other inference algorithms like Variational Inference.

5 Likes

A similar tool with Stan support is MCMC Monitor - MCMC Monitor -- Online monitoring of Stan runs

It doesn’t have the same set of warnings, but it is indeed incredibly useful to see some information about the posterior before sampling is completed.

Thanks for sharing!

3 Likes

Thank you very much for sharing MCMC Monitor, I was not aware of this before. From an initial quick look it seems incredibly useful and fleshed out, yet underappreciated, given the github stars and me not coming across it before, I wonder why? Given our results and everything I observed so far, this approach is clearly beneficial to end users. Do you have any ideas why this didn’t have wider adaption?

I think there are a few reasons:

  • Running it requires some tools like Node that not everyone would have. This could be alleviated by doing something like compiling it into a standalone executable with something like bun, I believe.
  • It doesn’t have the user-friendly warnings like you have developed, so some expertise is needed to get the real value. Some of your ideas could definitely find a home there in the future!
  • Generally speaking, promoting new tools is hard
2 Likes

Generally speaking, promoting new tools is hard

On that topic, I thought I might also mention that nutpie shows basic information like the step size, number of divergences and number of gradient evaluations per draw during sampling in its progress bar, and it is possible to get the current trace during sampling:

compiled = nutpie.compile_stan_model(code=stan_code).with_data(**data)
sampler = nutpie.sample(compiled_model_with_data, blocking=False)

This will return right away, and continue sampling in the background. You can then get the current trace with

trace = sampler.inspect()

This is not quite as detailed as MCMC Monitor (which is really cool by the way!), but it’s at least already what I usually need when I work on a model.

2 Likes

Every time you tell me about a new nutpie feature, my motivation to maintain cmdstanpy gets damaged just a little bit…

I do wonder if that would be a reasonable way to get the monitoring in this study for Stan models — it is implemented for pymc models, so maybe nutpie would be easier to adapt to?

Unfortunately I know to little of Stans internals to really judge that, but I believe it shouldn’t be too difficult. Our tool can be adapted quite quickly (at least partially).

For the weakest support with basic monitoring and basic warnings, you only need to send http requests to the tool. An initial request that contains the file, used algorithm, number of burnin samples and number of samples and then you could either send 1 request after each sample or you could send them batched every X samples (or every X seconds, doesn’t matter to the tool). For PyMC this was really easy because they have this `record` method in the PyMC backends that is called every sampler iteration. But I am sure that something similar could be done for stan. For stan specifically, given the stan programs are fully compiled c, if I am not mistaken, adding some http requests to the sampler when it’s compiled with a flag like `live_debugging` should be straight forward?

The nutpie feature seems neat too, but I believe it would be easier to implement with a subscriber like/callback interface.

To get the model graph and support for the funnel warnings we currently rely on GitHub - lasapp/lasapp: Language-Agnostic Static Analysis of Probabilistic Programming: Replication package . I believe it has no stan support, but we really only need an interface that our debugger can call that provides (a) a model graph, and (b) rv’s that have a scale parameter that is influenced by other rv’s.

And lastly for tailored warnings to stan, we need to add stan to the languages in the tool itself, which shouldn’t be to complicated (see PyMC language def: InferlogHolmes-Appendix/InferLogHolmes/extension/webview-src/ppl-debugger-webview/src/PPL/pymc.ts at main Ā· ipa-lab/InferlogHolmes-Appendix Ā· GitHub ). Otherwise it might be confusing when the warnings suggests to ā€œchange the target_accept to a higher valueā€, while stan would call this adapt_delta. Also the code change suggestions wouldn’t be great without šŸ˜…

But I also want to add that given MCMC monitor already exists for stan, and already seems more focused on people using it, than evaluating an idea (like our tool), maybe adding the warnings and sampler stats there would be an overall easier road.

Now that’s clever!

Every time you tell me about a new nutpie feature, my motivation to maintain cmdstanpy gets damaged just a little bit…

Please don’t stop, I really need it to run the benchmarks that show how much faster nutpie is.

About having nutpie integration with something more detailed like mcmc monitor or this new debugger, I wonder how much work it would be to just include a little webserver in nutpie, that serves a little website with this info during sampling. That would mean a bunch of new dependencies in nutpie, although some networking ones are already there due to remote zarr support to s3…