Due to currently slow recruitment of participants, we extended the duration of our study till the midst of April. If you come by this, please consider participating or forwarding our study invitation to anyone who might fit the candidate pool. It would be greatly appreciated.
This is a big ask in terms of time and I’m guessing people would want to know a bit more about the task. For example, are you assuming this is going to be done in some language like R or Python or in some inference scheme like BUGS or Stan or PyMC or are participants free to use whatever they want?
Also, I’m curious what you actually mean by “inference debugging”. Do you mean when the model is misspecified or miscoded and gives you strange results, or when you have the model you want coded correctly and sampling or variational inference doesn’t work?
Hi @Bob_Carpenter, pleasure to be here and thank you for your interest :)
Yes you are absolutely correct, it is a big ask and I am happy to provide a couple more details:
For this study we will use PyMC (sorry not Stan, but results should translate) and we provide a concise introduction to PyMC at the beginning.
You are free to use the internet or other resources at your disposal during the whole time of the study
The tasks all have the same form: You get a specified model and inference algorithm and your task is to find out if something is wrong, and if so → fix it. There could be problems with the specification of the model or problems at the inference algorithm level (e.g. bad hyperparamaters) or both.
Since this is a very open task design, the time to solve a task is different from participant to participant. All tasks have been pre-evaluated in another study to be solvable within 20 minutes. Some participants might need more time, but at a certain point we urge them to continue with the next task.
It might be that there are some helpful tools beyond classical inference analysis frameworks at your disposal
I hope this answers your questions and I am happy to go into more detail if you want to know anything else.
Thanks for the clarification. If you’re assuming a specific tool, you are going to need to adjust for how well users know that tool. I’ve been looking at PyMC for years and still find it confusing because I’ve never actually had to get real work done with it. And I find ArviZ even more confusing than PyMC. Though I find Python much less confusing than R. So for me, you wouldn’t be measuring how well I can reason about posteriors, you’d be measuring how well I could work through the frustration of learning to use an unfamiliar API during a timed trial. I’d be opening ChatGPT and asking it to translate things I know from Stan into PyMC and then working by trial and error.
I assume for bullet three that there’s also data—you can’t validate model plus inference algorithm unless you run something like simulation-based calibration, which would take more than 2 hours of compute for non-trivial models. On the other hand, you might be able to find a problem by simulating one data set it couldn’t fit.