Participation Needed: Probabilistic Programming Study

Hello everyone,

we are currently conducting a study on Bayesian inference debugging and would like to invite you to participate in our study.

It would be greatly appreciated and could help to further knowledge in the area of debugging inference and analyzing inference results.

The study will be conducted fully online and to take part you only need:

  • To have worked with probabilistic programming languages or Bayesian inference in some form before (at any level)
  • A stable internet connection
  • Zoom
  • A PC/laptop or similar device
  • A GitHub account

The study will take at most 2 hours of your time (90 minutes or less are likely) and you will be recorded during most parts.

It involves solving 3 inference debugging tasks and 2 questionnaires.

If you are interested and have time between the February 21th and March 6th, please reach out to nathanael.nussbaumer@tuwien.ac.at.

Thank you in advance and looking forward to your participation!

4 Likes

Due to currently slow recruitment of participants, we extended the duration of our study till the midst of April. If you come by this, please consider participating or forwarding our study invitation to anyone who might fit the candidate pool. It would be greatly appreciated.

Thank you very much.

Hi @nattube and welcome to the Stan Forums.

This is a big ask in terms of time and I’m guessing people would want to know a bit more about the task. For example, are you assuming this is going to be done in some language like R or Python or in some inference scheme like BUGS or Stan or PyMC or are participants free to use whatever they want?

Also, I’m curious what you actually mean by “inference debugging”. Do you mean when the model is misspecified or miscoded and gives you strange results, or when you have the model you want coded correctly and sampling or variational inference doesn’t work?

Hi @Bob_Carpenter, pleasure to be here and thank you for your interest :)

Yes you are absolutely correct, it is a big ask and I am happy to provide a couple more details:

  • For this study we will use PyMC (sorry not Stan, but results should translate) and we provide a concise introduction to PyMC at the beginning.

  • You are free to use the internet or other resources at your disposal during the whole time of the study

  • The tasks all have the same form: You get a specified model and inference algorithm and your task is to find out if something is wrong, and if so → fix it. There could be problems with the specification of the model or problems at the inference algorithm level (e.g. bad hyperparamaters) or both.

  • Since this is a very open task design, the time to solve a task is different from participant to participant. All tasks have been pre-evaluated in another study to be solvable within 20 minutes. Some participants might need more time, but at a certain point we urge them to continue with the next task.

  • It might be that there are some helpful tools beyond classical inference analysis frameworks at your disposal

I hope this answers your questions and I am happy to go into more detail if you want to know anything else.

Thanks for the clarification. If you’re assuming a specific tool, you are going to need to adjust for how well users know that tool. I’ve been looking at PyMC for years and still find it confusing because I’ve never actually had to get real work done with it. And I find ArviZ even more confusing than PyMC. Though I find Python much less confusing than R. So for me, you wouldn’t be measuring how well I can reason about posteriors, you’d be measuring how well I could work through the frustration of learning to use an unfamiliar API during a timed trial. I’d be opening ChatGPT and asking it to translate things I know from Stan into PyMC and then working by trial and error.

I assume for bullet three that there’s also data—you can’t validate model plus inference algorithm unless you run something like simulation-based calibration, which would take more than 2 hours of compute for non-trivial models. On the other hand, you might be able to find a problem by simulating one data set it couldn’t fit.

Yes that’s import and it is something we try to capture through our questionnaires.

honestly, that is a valid strategy :)

Yes there is data, I just didn’t mention it there because there won’t be issues with the prepared data.

If all of this catched your interest and you want to give it a try I’d be happy if you have time to participate.

I’m curious about the result, but two hours is a lot of time!