AutoStan: Can we automate (parts of) Bayesian workflow with coding agents?

Bayesian modeling with Stan is beautiful, principled, and powerful — but it requires a bit of babysitting and knowledge of the MCMC process. Divergences, R-hat, reparameterizations, … This gap between “just run lm()” and “write a well-specified Stan model” keeps some practitioners away.

I recently stumbled over Karpathy’s autoresearch (GitHub - karpathy/autoresearch: AI agents running research on single-GPU nanochat training automatically · GitHub) in which a coding agent like Claude Code is used to autonomously optimize the training process of an LLM overnight. As a pet project, I tried something similar for Bayesian modeling: I gave Claude Code a dataset and a short natural language description. The agent iterates on the Stan model file, guided by two feedback signals only: NLPD (i.e. log-score) on held-out data and the Stan log file (divergences, R-hat, ESS). No domain knowledge baked in, no custom framework — it figures out things like reparameterization on its own and iteratively writes better Stan code. That’s a bit crazy.

On a regression dataset with outliers it progressed from naive linear regression to a contamination mixture model, matching TabPFN while remaining fully interpretable.

This is really a rough write-up of an experiment more than a polished paper. I’m genuinely unsure what to make of it and curious what you guys think of it. Have you made similar experiments with agents like claude code writing Stan code?

ArXiv: http://arxiv.org/abs/2603.27766
GitHub: https://github.com/tidit-ch/autostan

Considering how many case studies, git repos and discussions about these exist, and how well recent Claude version perform in other tasks, I’m not surprised.

I have use Claude for writing and editing Stan code, but not for fully automated fashion. If I would, I would prefer to know the metrics for intermediate models, too. It would also likely that I would provide more context and certain components that I’d prefer to be tested anyway. In the completely automatic process there are the problems of overfitting and losing explainability when the components are not well informed (the usual mixture model mode) you also listed and which were obvious also in Automatic Statistician years ago (it seems the web page for Automatic Statistician is not working anymore). I think using AI agents for speeding up model exploration can be useful, but still you would need to understand what they produce as they do make silly things, too.

Thanks for sharing your experiments

Thanks Aki!

Considering how many case studies, git repos and discussions about these exist, and how well recent Claude version perform in other tasks, I’m not surprised.

Yes, there is so much Bayes in repos and other case studies that coming up with a novel dataset for agents is genuinely hard. I tried to anonymize a football dataset, but Claude figured it out in the first iteration and went straight to a Poisson attack/defense structure. When I asked, it said something along the lines of: *“18 entities, integers from 0 to 7 — this must be soccer.”

I have use Claude for writing and editing Stan code, but not for fully automated fashion. If I would, I would prefer to know the metrics for intermediate models, too.

Regarding intermediate metrics, they are logged. The full NLPD trajectory per iteration is shown in Figure 1 of the paper and the complete iteration history is available in the repo: see results and models for the Stan files.

It would also likely that I would provide more context and certain components that I’d prefer to be tested anyway.

Yes, that is definitely something worth exploring further.

In the completely automatic process there are the problems of overfitting and losing explainability when the components are not well informed (the usual mixture model mode) you also listed and which were obvious also in Automatic Statistician years ago (it seems the web page for Automatic Statistician is not working anymore).

Overfitting might to some degree be handled by the held-out NLPD, but explainability is a harder problem.

I think using AI agents for speeding up model exploration can be useful, but still you would need to understand what they produce as they do make silly things, too.

Definitely — but the nice thing with Stan is that you can at least understand what the agent is doing.

Thanks for sharing these results. On one hand, it’s unsurprising that the agents reliably produced strong performing models in the tasks you laid out given that they all fall into fairly standard modeling tasks, but still impressive to be able to produce what does take a decent bit of domain expertise in Bayesian modeling to pull off.

It would be interesting to try to find real world datasets where the modeling approach is a tricky latent structure that requires a good deal of reasoning to get something sensible.

Beyond the fully automated modeling process you show here, I’ve had some good luck using agents to assist in some more tedious aspects of the typical Bayesian workflow like getting my real data to fit my Stan input, helping restructure Stan code for performance, handling the aspects of coding up new model structure, and so on. I’ve even had some luck asking for open ended posterior predictive checks to identify areas where my current model falls short.

Even stopping short of fully automated, there’s certainly parts of the workflow that I feel can be effectively automated with these tools today.

I’ve used coding models on industrial, statistical or ML workflows for a while, but in an agentic form since last autumn.

The latest ones have been with Nutpie samplers, Stan models, R and Python. My current workflow is that the agent does data conversions, models, diagnostics, and plotting quite autonomously, after a somewhat extensive initial prompt, a plan in an md file, and some back and forth. I occasionally comment on details like non-convergence, give tuning tips like the low-rank mass matrix adaptation available in Nutpie. Then I look at the plots and diagnostics and think for an hour or two, and continue to the next iteration.

I consider myself an experienced Stan programmer, but the latest Opuses beat me on technical details like vectorizing or handling likelihoods semi-manually. I still have better intuition on geometry and the big picture, and need to decide on or at least accept main architectural characteristics, and some detail of parametrization. The agent is sometimes able to dig into convergence problems, it waits for models to run, looks at divergences, step sizes, makes experiments on fixing things, etc.

Obviously, all this is a huge change to what my work used to be. Now it’s mostly about the big picture, strategic decisions on how to model, and a lot more fast iteration. My statistics or ML models are more complex, as is the contextual surface both in inputs and outputs.

I do half sampling maybe, half optimization (often Laplace on top) for prediction. I actually give the model some choice on what it wants to use for implementation; it has often converged to Stan maybe due to my implicit preferences, but I’ve been open to JAX etc., and the models are perfectly capable of writing most or all likelihoods manually, so the implementation layer is really quite fluid now.

(Edit: so… for me it’s not exactly automatizing the workflow, more like making it 5-20x faster, and handling levels of complexity not feasible otherwise.)

(Edit2: David Shor, somewhat known as a data scientist working for the US democrats, said he used 20k USD for Claude Code tokens in April, personally. The amount sounds excessive, I use maybe 200 EUR per month, but is indicative of how the future will be.)