In my Bayesian seminar today, we discussed at length how step-size and adapt-delta change the way we explore and sample from the posterior distribution. We were looking at the Hoffman & Gelman (2014) paper, but I’m wondering if there is a more intuitive or accessible explanation of what these hyperparameters do, how they affect how we explore the posterior, what the consequences of doing this is, and the thinking behind it was?
Does anyone know if a blog post or journal article or explanation elsewhere that explains the NUTS in a little more broader, conceptual terms?
adapt_delta just sets the target “acceptance rate” for the sampler. A higher target acceptance rate means adaptation will find lower step sizes. Once warmup’s done, these are locked in.
How adaptation works has changed over versions. But that target acceptance is now complicated as we’re not using the basic NUTS algorithm.
The main issue you run into is conditioning—the usual bugbear of any kind of gradient-based algorithm. If you get into a location in the posterior where the step size is too large, you get divergences. We only use gradient-based approximations (i.e., first order) of the real posterior curvature, so sometimes we need small step sizes to do that accurately.