AutoStan: Can we automate (parts of) Bayesian workflow with coding agents?

Bayesian modeling with Stan is beautiful, principled, and powerful — but it requires a bit of babysitting and knowledge of the MCMC process. Divergences, R-hat, reparameterizations, … This gap between “just run lm()” and “write a well-specified Stan model” keeps some practitioners away.

I recently stumbled over Karpathy’s autoresearch (GitHub - karpathy/autoresearch: AI agents running research on single-GPU nanochat training automatically · GitHub) in which a coding agent like Claude Code is used to autonomously optimize the training process of an LLM overnight. As a pet project, I tried something similar for Bayesian modeling: I gave Claude Code a dataset and a short natural language description. The agent iterates on the Stan model file, guided by two feedback signals only: NLPD (i.e. log-score) on held-out data and the Stan log file (divergences, R-hat, ESS). No domain knowledge baked in, no custom framework — it figures out things like reparameterization on its own and iteratively writes better Stan code. That’s a bit crazy.

On a regression dataset with outliers it progressed from naive linear regression to a contamination mixture model, matching TabPFN while remaining fully interpretable.

This is really a rough write-up of an experiment more than a polished paper. I’m genuinely unsure what to make of it and curious what you guys think of it. Have you made similar experiments with agents like claude code writing Stan code?

ArXiv: http://arxiv.org/abs/2603.27766
GitHub: https://github.com/tidit-ch/autostan

3 Likes

Considering how many case studies, git repos and discussions about these exist, and how well recent Claude version perform in other tasks, I’m not surprised.

I have use Claude for writing and editing Stan code, but not for fully automated fashion. If I would, I would prefer to know the metrics for intermediate models, too. It would also likely that I would provide more context and certain components that I’d prefer to be tested anyway. In the completely automatic process there are the problems of overfitting and losing explainability when the components are not well informed (the usual mixture model mode) you also listed and which were obvious also in Automatic Statistician years ago (it seems the web page for Automatic Statistician is not working anymore). I think using AI agents for speeding up model exploration can be useful, but still you would need to understand what they produce as they do make silly things, too.

Thanks for sharing your experiments

1 Like

Thanks Aki!

Considering how many case studies, git repos and discussions about these exist, and how well recent Claude version perform in other tasks, I’m not surprised.

Yes, there is so much Bayes in repos and other case studies that coming up with a novel dataset for agents is genuinely hard. I tried to anonymize a football dataset, but Claude figured it out in the first iteration and went straight to a Poisson attack/defense structure. When I asked, it said something along the lines of: *“18 entities, integers from 0 to 7 — this must be soccer.”

I have use Claude for writing and editing Stan code, but not for fully automated fashion. If I would, I would prefer to know the metrics for intermediate models, too.

Regarding intermediate metrics, they are logged. The full NLPD trajectory per iteration is shown in Figure 1 of the paper and the complete iteration history is available in the repo: see results and models for the Stan files.

It would also likely that I would provide more context and certain components that I’d prefer to be tested anyway.

Yes, that is definitely something worth exploring further.

In the completely automatic process there are the problems of overfitting and losing explainability when the components are not well informed (the usual mixture model mode) you also listed and which were obvious also in Automatic Statistician years ago (it seems the web page for Automatic Statistician is not working anymore).

Overfitting might to some degree be handled by the held-out NLPD, but explainability is a harder problem.

I think using AI agents for speeding up model exploration can be useful, but still you would need to understand what they produce as they do make silly things, too.

Definitely — but the nice thing with Stan is that you can at least understand what the agent is doing.

Thanks for sharing these results. On one hand, it’s unsurprising that the agents reliably produced strong performing models in the tasks you laid out given that they all fall into fairly standard modeling tasks, but still impressive to be able to produce what does take a decent bit of domain expertise in Bayesian modeling to pull off.

It would be interesting to try to find real world datasets where the modeling approach is a tricky latent structure that requires a good deal of reasoning to get something sensible.

Beyond the fully automated modeling process you show here, I’ve had some good luck using agents to assist in some more tedious aspects of the typical Bayesian workflow like getting my real data to fit my Stan input, helping restructure Stan code for performance, handling the aspects of coding up new model structure, and so on. I’ve even had some luck asking for open ended posterior predictive checks to identify areas where my current model falls short.

Even stopping short of fully automated, there’s certainly parts of the workflow that I feel can be effectively automated with these tools today.

1 Like