The role of "max_treedepth" in No-U-Turn?

Hi,

I am working on a model where I saw the treedepth could reach 14, leading to a slow fitting. I wanted to know better the role of this parameter, thus I tested a simpler model with different max_treedepth values. It seems for me that too low max_treedepth would lead to inefficient sampling (low ESS).

Could someone explain how max_treedepth works in the algorithm? What is the appropriate value of max_treedepth when use the stan() function in rstan package? Thank you.

Stan uses the No-U-Turn-Sampler (NUTS) described in [1111.4246] The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo.

NUTS builds a binary tree by taking forward/backwards “directional” steps to explore the target posterior distribution guided towards the highest probability density regions by the gradient of the log-posterior distribution.

The max_treedepth parameter tells Stan the max value, in exponents of 2, of what the binary tree size in the NUTS algorithm should have. The default is 10 which implies that Stan should build a maximum of 2^{10} = 1024 nodes. This means that the algorithm would build a binary tree with upmost size (or “height”) of 10.

Now, how to set a max_treedepth value one should heed for two sources of advises:

  1. Taming Divergences in Stan Models by @martinmodrak
  2. Divergent transitions - a primer also by @martinmodrak

I have a personal note in my computer that unfortunately I don’t know what is original and what is copied from someone else. So if I am not giving credit, let me know, so I can fix this.


A divergence arises when the simulated Hamiltonian trajectory departs from the true trajectory as measured by departure of the Hamiltonian value from its initial value.

  1. Check your code. Twice . Divergences are almost as likely a result of a programming error as they are a truly statistical issue. Do all parameters have a prior? Do your array indices and for loops match?
  2. Create a simulated dataset with known true values of all parameters . It is useful for so many things (including checking for coding errors). If the errors disappear on simulated data, your model may be a bad fit for the actual observed data.
  3. Check your priors . If the model is sampling heavily in the very tails of your priors or on the boundaries of parameter constraints, this is a bad sign.
  4. Visualisations : use mcmc_parcoord from the [bayesplot](<https://cran.r-project.org/web/packages/bayesplot/index.html>) package, Shinystan and pairs from rstan . Documentation for Stan Warnings (contains a few hints), Case study - diagnosing a multilevel model, Gabry et al. 2017 - Visualization in Bayesian workflow
  5. Make sure your model is identifiable - non-identifiability and/or multimodality (multiple local maxima of the posterior distributions) is a problem. Case study - mixture models, my post on non-identifiable models and how to spot them.
  6. Run Stan with the test_grad option.
  7. Reparametrize your model to make your parameters independent (uncorrelated) and close to N(0,1) (a.k.a change the actual parameters and compute your parameters of interest in the transformed parameters block).
  8. Try non-centered parametrization - this is a special case of reparametrization that is so frequently useful that it deserves its own bullet. Case study - diagnosing a multilevel model, Betancourt & Girolami 2015
  9. Move parameters to the data block and set them to their true values (from simulated data). Then return them one by one to parameters block. Which parameter introduces the problems?
  10. Introduce tight priors centered at true parameter values . How tight need the priors to be to let the model fit? Useful for identifying multimodality.
  11. Play a bit more with adapt_delta , stepsize and max_treedepth . Example
5 Likes

The dynamic Hamiltonian Monte Carlo sampler used in Stan is not the No-U-Turn sampler. Trajectories are still built from binary trees but just about everything else has changed.

Saturating the maximum tree depth is not related to divergent trajectories. Most of the advice that follows unfortunately doesn’t apply.

Hamiltonian Monte Carlo explores a given probability distribution with deterministic trajectories that are able to span large regions of your model configuration (i.e. parameter) space. In practice we can’t construct these trajectories exactly, but we can approximate them with numerical trajectories that consist of a discrete sequence of points. The accuracy of this approximation is determined by a step size. The smaller the step size the more points in the numerical trajectory but the closer the numerical trajectory is to the exact continuous trajectory. The larger the step size the fewer points in the numerical trajectory but the more inaccurate it will be.

The performance of Hamiltonian Monte Carlo depends on the size of these trajectories. If the exact trajectory is too short then the sampler explores only slowly, and if the trajectory is too long then it can return to where it started and waste time on redundant exploration.
The more complex your target probability distribution the longer the exact trajectories will need to be to explore, and the smaller the step size will need to be so that our numerical approximations of those trajectories are sufficiently accurate. In other words the more complex your target probability distribution the more discrete steps you’ll need in each trajectory.

Stan uses a dynamic Hamiltonian Monte Carlo sampler that automatically determines how long these trajectories should be based on the behavior of your target probability distribution. If your target distribution is ill-defined and stretches all the way to infinity, however, then Stan would actually try to build an infinitely long trajectory and never stop. To avoid this there is a maximum trajectory length equal to 2^{\text{max_treedepth}}, which equals 1024 steps for the default max_treedepth = 10, for safety.

If you’re saturating that maximum trajectory size then your target probability distribution is probably highly degenerate, and may even be non-identifiable. For some more discussion on these two terms and strategies for responding to maximum tree depth warnings see for example Identity Crisis.

There are an infinite number of ways that your target probability distribution could be complicated that could lead to this warning and so there’s no immediate fix. Instead you’ll have to investigate your target distribution as discussed in that link to identify what the problem is and then react to that particular problem.

5 Likes

Amazing! No one better than @betanalpha to explain HMC and NUTS. Thanks!

2 Likes

This is very clear and helpful. I still have a question regarding "Hamiltonian Monte Carlo explores a given probability distribution with deterministic trajectories that are able to span large regions of your model configuration (i.e. parameter) space. "

How is this deterministic trajectory achieved numerically ? It is based on gradient descent? This trajectories should be different for different chains, is it only based on the initial condition? Thank you.

If you’re saturating that maximum trajectory size then your target probability distribution is probably highly degenerate, and may even be non-identifiable. For some more discussion on these two terms and strategies for responding to maximum tree depth warnings see for example Identity Crisis.

Yes, this model is ODE based; it was not straightforward to have it run appropriately, and informative priors were needed. Now, I add a survival submodel on top of the ODE mix-effects model. The target probability distribution is even more complex, and indeed might be non-identifiable. However, the interesting thing is that the two parts (ODE mix-effects model and the survival model) are working fine independently, although the joint model doesn’t run appropriately.
I’ll need to carefully look into each parameter.

Did you mean the value of the target distribution can be infinite? The log target distribution could have very large absolute values, due to exp() terms.

Another question regarding Hamiltonian Monte Carlo explores a given probability distribution with deterministic trajectories that are able to span large regions of your model configuration (i.e. parameter) space.
Sometimes, I use constrained parameters. It happened to me once that after adding more constrains on parameters (because I knew from experience that the parameter should not reach those values), some chains just moved super quickly without sampling anything. Could you please explain the mechanism behind? Thank you.

One important detail – every iteration of Markov chain Monte Carlo is constructed from a numerical trajectory. In other words if you run with 1000 iterations then Stan’s dynamic Hamiltonian Monte Carlo will generate 1000 iterations, keeping only one point from each that is saved in Stan’s output. Each Markov chain will then work through an entire ensemble of different trajectories.

The deterministic, numerical trajectories are generated with a symplectic leapfrog integrator. For more see https://arxiv.org/abs/1701.02434, specially Section 5.1.

This is not uncommon. If individual component models are misbehaving then the joint model will also misbehave, but if all the individual component models work okay on their own the joint model can still misbehave. In that case one the source of the problem is in the interaction between the component models.

No I mean that the posterior density function remains non-zero even when the parameter values are infinitely large instead of decaying to zero.

Stan handles constraints by transforming the constrained parameter space to an unconstrained space without any boundaries that need to be considered. The Hamiltonian Monte Carlo sampler explores this unconstrained space and then the output is mapped back to the constrained space before being returned to the user. If the posterior distribution concentrates away from the constraint boundaries then there shouldn’t be much of a performance difference between exploring the constrained and unconstrained spaces, but if the posterior concentrates anywhere near at the constraint boundary then you can see substantial differences.

Some Markov chains running super quickly without actually moving suggests that the adaptation for those chains fell into a bad configuration, usually small step sizes and very large inverse metric elements. You can investigate this using the get_adaptation_info functions in RStan and PyStan.

1 Like

When I used the individual-level longitudinal values predicted by the ODEs and integrated them in the survival part (so called “two-stage joint model”), the joint model worked well. Only when I tried to estimate the parameters of the 2 parts together, I got bad performance. Do you have some explanations? Thank you.

There are multiple factors that could be at play. The joint model could be missing some critical structure but flexible enough to contort in weird ways; simpler models are sometimes rigid enough that they yield nice likelihood functions even when they’re misfitting the data. Another possibility is that the survival observations just aren’t enough to inform the latent dynamics, which is a common problem. There are just too many different dynamical evolutions that can yield the exact same survival profile, especially when survival is measured only at a few times.