Materials for thoroughly understanding Stan

kokorap27 · May 11, 2025, 7:31pm

Dear Stan Community:

Over the last years I’ve been using Stan for implementing Bayesian inference, however, I haven’t learnt Stan’s syntax and how it really works behind the scenes, and thus, I get stuck with some Stan ‘subtlety’ (it actually means my lack of understanding) a lot of times.

Could someone please point me to the correct material for starting on properly learning Stan at once? I’d also appreciate suggestions on material for understanding how Stan actually works as to at least get an idea of the primary reasons for divergence.

Kind regards,

Roman.

miles · May 12, 2025, 2:39pm

Hi @kokorap27!

Is it Stan per se (i.e. how Stan constructs sums of log densities rather than more graphical paradigms as in PyMC3) or is it just the probabilistic programming language itself? For the later, I do really think the user guide and reference manual are great. If you’ve already spent a good deal of time over there, I’d also like to recommend the Stan case studies and tutorials where you may find some implementations (including discussion of optimizations and pitfalls) that are relatable to you and your research. In general, Richard McElreath’s Statistical Rethinking book and video lectures are hard not to recommend for a soft introduction to Stan & working with NUTS-HMC.

I’m assuming you’d like recommendations to understand the HMC broadly, to understand how pathologies may arise and how they relate to divergent transitions. For that, I highly recommend starting here: Michael Betancourt. 2017. “A Conceptual Introduction to Hamiltonian Monte Carlo.” arXiv 1701.02434. [1701.02434] A Conceptual Introduction to Hamiltonian Monte Carlo.

@martinmodrak also has a great post on the forums here to supplement Divergent transitions - a primer

mitzimorris · May 12, 2025, 6:03pm

To understand the NUTS-HMC sampler, a very accessible paper is Faster estimation of Bayesian models in ecology using Hamiltonian Monte Carlo (should be open access).

In the paper Visualization in the Bayesian Workflow, section 4 has discussion of what divergences indicate. In particular, figure 5 shows a model where the sample contains divergences, but the divergences don’t indicate a problem. (As opposed to divergences clustered in the neck of a funnel distribution - a very common problem).

kokorap27 · May 12, 2025, 7:41pm

Hello @miles!

It’s the programming language itself, so thanks a lot for the pointers and recommendations - I’ll try to read them all. At some point, I also would love to understand Stan per se.

These suggestions look promising as I’d learn a bit better HMC. Much appreciate.

kokorap27 · May 12, 2025, 7:48pm

Hi @mitzimorris!

Thanks a lot for your suggestions. I’m looking forward to reading them.

By the way, this effort was driven by the need to better understand how Stan works in order to implement a Bayesian spatio-temporal model for ecological data. I reckon your work on ICARs and Connor’s on CARs would be building blocks for the implementation.

mitzimorris · May 12, 2025, 8:13pm

here’s my take on Stan and the Stan programming language: geomed_2024/slides.pdf at main · mitzimorris/geomed_2024 · GitHub

recently I’ve been trying to explain to BRMS users how to specify models directly in Stan, and this would be a good starting point for users of other R packages who are comfortable specifying formulas and for whom Stan’s syntax is daunting:

miles · May 12, 2025, 8:55pm

@mitzimorris This is a great recommendation; I hadn’t read this one before! It’s much more approachable than the Betancourt paper I linked above… Thanks for sharing this! This in conjunction with the very helpful “Visualization in the Bayesian Workflow” paper are awesome starting points.

Bob_Carpenter · May 16, 2025, 8:36pm

I’m currently working on writing a clean standalone description of the form of NUTS used in Stan. There’s not a good reference for that. Betancourt’s paper sketches out the main ideas, but there’s nothing like a piece of pseudocode that describes the algorithm in one place (if people know of one, please link in comments!).

The cleanest description of the algorithm I know is the one we’re analyzing in our Gibbs self tuning papers.

[2404.15253] GIST: Gibbs self-tuning for locally adaptive Hamiltonian Monte Carlo
[2408.08259] Incorporating Local Step-Size Adaptivity into the No-U-Turn Sampler using Gibbs Self Tuning

I wrote a clean reference implementation of NUTS in C++, but it’s pretty “modern” C++ with a fair bit of templating and move semantics:

walnuts/walnuts_cpp/include/walnuts/nuts.hpp at main · bob-carpenter/walnuts · GitHub

This repo also has a parallel file walnuts.hpp, which is our new algorithm that adapts step size similarly to how NUTS adapts number of steps. The paper and implementation for Stan models should be out soon. I’ve been lobbying for releasing in the style of Adrian Seyboldt’s (@aseyboldt) we’re following the Nutpie strategy for releasing samplers that work with both Stan and PyMC).

P.S. The author of that nice paper on HMC marginalization cited above, Cole Monnahan (@monnahc), is working on some really cool initialization for HMC using Laplace approximations derived form max marginal likelihood fits from TMB (the fisheries and wildlife version of lme4 that’s embedded in ADMB). We’ll keep you posted.

Bob_Carpenter · May 16, 2025, 8:42pm

Oh, and I’d also recommend my own intro to Stan, which has some introductory material on how sampling and MC(MC) methods work in general that I think of as required background for understanding how Bayesian posterior inference with MCMC works:

It doesn’t go into detail about how the Stan language works, though.

I also really like my intro to basic probability theory in the appendix (yes, sigma algebras, but no heavy measure theory). It was drawn out of me by my colleagues in the Center for Computational Mathematics here at Flatiron Institute, all of whom are ridiculously good at math, but don’t do much probability. I personally wanted to learn at least this much probability theory because I couldn’t understand what people meant by “random variable” in the more introductory texts. I based it on my favorite intro to probability theory, which I found in an appendix to a signal processing book from the 1970s (Anderson and Moore, Optimal Filtering).

kokorap27 · May 19, 2025, 9:36am

Hello @Bob_Carpenter!

Thanks a lot for the clarification on the links between the algorithm implemented in Stan and NUTS, as well as for the heads-up on what’s coming. Looking forward to reading the paper.

kokorap27 · May 19, 2025, 9:42am

I’ve skimmed it, so I reckon it can be a good practice to read it and reproduce the examples in Python – it has been a while since I use Python for tasks related to Statistics.

Solomon · May 19, 2025, 3:03pm

This is not very technical, but it may be of help to some: Statistical rethinking 2 with rstan and the tidyverse

monnahc · May 19, 2025, 3:38pm

FWIW I have an old version of NUTS in R which may be helpful if that’s the language you’re most comfortable with. It’s “clean” and has dual averaging. It should not be used for anything besides toying around to understand it better. But is hopefully more accessible than other implementations. It’s also 9 years old so hopefully it runs still!

github.com/colemonnahan/gradmcmc

algorithms/examples.R

master

### This file quickly demonstrates how to use the static HMC and NUTS
### algorithms provided in the mcmc.R file in this folder. Here we use
### simple multivariate normal model with analytical gradients, avoiding
### the need for automatic differentiation. See function delcarations for
### argument definitions. These functions do not adapt the mass matrix,
### rather the user specifies it.

library(plyr)
library(mvtnorm)
## This file contains a function to make a leapfrog trajectory
source('mcmc.R')

## 2d multivariate normal with static HMC leapfrog trajectories, with and
## without mass matrix
covar <- matrix(c(1,-.95,-.95,1), 2)
covar.inv <- solve(covar)
## Analytical log-densities and gradients, not optimized so really slow
fn <- function(x) -as.vector(dmvnorm(as.vector(x), sigma=covar, log=TRUE))
gr <- function(x) as.vector(covar.inv%*%x)
## Create a fake minimal TMB object

This file has been truncated. show original

kokorap27 · May 21, 2025, 9:41am

Hi @Solomon!

Thanks a lot for your recommendation, I’ll have a look at it for sure.

kokorap27 · May 21, 2025, 11:10am

Hi @monnahc!

I really appreciate the material, I reckon it’s gonna be quite helpful playing around with the code to gain a better understanding of NUTS.

Topic		Replies	Views
Stan under the hood resources? General	5	923	February 19, 2018
Understanding basics of Bayesian statistics and modelling General howto	3	2859	August 5, 2020
Implementing and evaluating a new inference algorithm Developers	5	913	February 16, 2022
Personal Stan Guide General	2	459	May 6, 2024
Bayesian Class Videos Publicity videos	3	3117	March 1, 2019

Materials for thoroughly understanding Stan

Related topics