Bayesian modeling with Stan is beautiful, principled, and powerful — but it requires a bit of babysitting and knowledge of the MCMC process. Divergences, R-hat, reparameterizations, … This gap between “just run lm()” and “write a well-specified Stan model” keeps some practitioners away.
I recently stumbled over Karpathy’s autoresearch (GitHub - karpathy/autoresearch: AI agents running research on single-GPU nanochat training automatically · GitHub) in which a coding agent like Claude Code is used to autonomously optimize the training process of an LLM overnight. As a pet project, I tried something similar for Bayesian modeling: I gave Claude Code a dataset and a short natural language description. The agent iterates on the Stan model file, guided by two feedback signals only: NLPD (i.e. log-score) on held-out data and the Stan log file (divergences, R-hat, ESS). No domain knowledge baked in, no custom framework — it figures out things like reparameterization on its own and iteratively writes better Stan code. That’s a bit crazy.
On a regression dataset with outliers it progressed from naive linear regression to a contamination mixture model, matching TabPFN while remaining fully interpretable.
This is really a rough write-up of an experiment more than a polished paper. I’m genuinely unsure what to make of it and curious what you guys think of it. Have you made similar experiments with agents like claude code writing Stan code?
ArXiv: http://arxiv.org/abs/2603.27766
GitHub: https://github.com/tidit-ch/autostan