Over the last years I’ve been using Stan for implementing Bayesian inference, however, I haven’t learnt Stan’s syntax and how it really works behind the scenes, and thus, I get stuck with some Stan ‘subtlety’ (it actually means my lack of understanding) a lot of times.
Could someone please point me to the correct material for starting on properly learning Stan at once? I’d also appreciate suggestions on material for understanding how Stan actually works as to at least get an idea of the primary reasons for divergence.
Is it Stan per se (i.e. how Stan constructs sums of log densities rather than more graphical paradigms as in PyMC3) or is it just the probabilistic programming language itself? For the later, I do really think the user guide and reference manual are great. If you’ve already spent a good deal of time over there, I’d also like to recommend the Stan case studies and tutorials where you may find some implementations (including discussion of optimizations and pitfalls) that are relatable to you and your research. In general, Richard McElreath’s Statistical Rethinking book and video lectures are hard not to recommend for a soft introduction to Stan & working with NUTS-HMC.
I’m assuming you’d like recommendations to understand the HMC broadly, to understand how pathologies may arise and how they relate to divergent transitions. For that, I highly recommend starting here: Michael Betancourt. 2017. “A Conceptual Introduction to Hamiltonian Monte Carlo.” arXiv 1701.02434. [1701.02434] A Conceptual Introduction to Hamiltonian Monte Carlo.
In the paper Visualization in the Bayesian Workflow, section 4 has discussion of what divergences indicate. In particular, figure 5 shows a model where the sample contains divergences, but the divergences don’t indicate a problem. (As opposed to divergences clustered in the neck of a funnel distribution - a very common problem).
It’s the programming language itself, so thanks a lot for the pointers and recommendations - I’ll try to read them all. At some point, I also would love to understand Stan per se.
These suggestions look promising as I’d learn a bit better HMC. Much appreciate.
Thanks a lot for your suggestions. I’m looking forward to reading them.
By the way, this effort was driven by the need to better understand how Stan works in order to implement a Bayesian spatio-temporal model for ecological data. I reckon your work on ICARs and Connor’s on CARs would be building blocks for the implementation.
recently I’ve been trying to explain to BRMS users how to specify models directly in Stan, and this would be a good starting point for users of other R packages who are comfortable specifying formulas and for whom Stan’s syntax is daunting:
@mitzimorris This is a great recommendation; I hadn’t read this one before! It’s much more approachable than the Betancourt paper I linked above… Thanks for sharing this! This in conjunction with the very helpful “Visualization in the Bayesian Workflow” paper are awesome starting points.
I’m currently working on writing a clean standalone description of the form of NUTS used in Stan. There’s not a good reference for that. Betancourt’s paper sketches out the main ideas, but there’s nothing like a piece of pseudocode that describes the algorithm in one place (if people know of one, please link in comments!).
The cleanest description of the algorithm I know is the one we’re analyzing in our Gibbs self tuning papers.
This repo also has a parallel file walnuts.hpp, which is our new algorithm that adapts step size similarly to how NUTS adapts number of steps. The paper and implementation for Stan models should be out soon. I’ve been lobbying for releasing in the style of Adrian Seyboldt’s (@aseyboldt) we’re following the Nutpie strategy for releasing samplers that work with both Stan and PyMC).
P.S. The author of that nice paper on HMC marginalization cited above, Cole Monnahan (@monnahc), is working on some really cool initialization for HMC using Laplace approximations derived form max marginal likelihood fits from TMB (the fisheries and wildlife version of lme4 that’s embedded in ADMB). We’ll keep you posted.
Oh, and I’d also recommend my own intro to Stan, which has some introductory material on how sampling and MC(MC) methods work in general that I think of as required background for understanding how Bayesian posterior inference with MCMC works:
It doesn’t go into detail about how the Stan language works, though.
I also really like my intro to basic probability theory in the appendix (yes, sigma algebras, but no heavy measure theory). It was drawn out of me by my colleagues in the Center for Computational Mathematics here at Flatiron Institute, all of whom are ridiculously good at math, but don’t do much probability. I personally wanted to learn at least this much probability theory because I couldn’t understand what people meant by “random variable” in the more introductory texts. I based it on my favorite intro to probability theory, which I found in an appendix to a signal processing book from the 1970s (Anderson and Moore, Optimal Filtering).
Thanks a lot for the clarification on the links between the algorithm implemented in Stan and NUTS, as well as for the heads-up on what’s coming. Looking forward to reading the paper.
I’ve skimmed it, so I reckon it can be a good practice to read it and reproduce the examples in Python – it has been a while since I use Python for tasks related to Statistics.
FWIW I have an old version of NUTS in R which may be helpful if that’s the language you’re most comfortable with. It’s “clean” and has dual averaging. It should not be used for anything besides toying around to understand it better. But is hopefully more accessible than other implementations. It’s also 9 years old so hopefully it runs still!