Levers to pull to increase stepsize so as to avoid saturating the treedepth

So, we have a model that takes a long time to run.

  • The gradient evaluation is about 0.2 seconds (*)
  • We are currently running it for about 250 samples in both warmup and sampling
  • The adaptation typically “goes well” in the sense that multiple chains converge on the same step size and inverse metric
  • We don’t encounter divergences
  • Our chains mix well (Rhats and Neff are both good)
  • We have a very large number of parameters (~35,000) (**)

However we often have runs that saturate the treedepth. This leads to very long runtimes. About 29 hours (0.2 * 1023 * 500 / 60 / 60 ~= 29.)

Here are some things that we have tried:

  1. Doing a VB run (or a short sampling run) before sampling to rescale transformed parameters such that parameters have ~unit variance on the unconstrained scale
  2. Doing a VB run before sampling, taking the posterior covariance matrix (which is ~diagonal) and supplying it as the inv_metric to the sampling call
  3. Doing (1) then (2)

We have not touched the step_size or adapt_delta arguments.

Despite doing (1) and/or (2) we often end up with small stepsizes (~0.005) and resulting treedepth saturation. Because of the lack of divergences and good mixing (as well as priors and posteriors that match our expectations), I am not concerned about the validity of the model. But the 29h thing is a bit … tough, especially since this model is both under active development and something we use in production. Yay.

So, the question is: what can we do to get larger step sizes to avoid saturating the treedepth?

(*) we have done a ton to try to reduce this, including attempts to use map_rect and reduce_sum, but this led to degradations in performance since the CPU was spending more time shuttling data around than doing computations. Until stan gets a conv1d function that handles an autodiff through a convolution more efficiently, or an FFT function, we’re limited here)

(**) but a very large number of these have relatively minimal impact on the model (many are std_normal variables getting multiplied by a cholesky decomposition of a covariance matrix)

Hi Thomas, I believe that decreasing adapt_delta will increase the stepsize.

As for the model speed - if you’re using large vectorised operations, the 2.26 release included a greatly increased list of functions with support for OpenCL acceleration (list in this post). Additionally, it also included a profiling framework which could be helpful for identifying any bottlenecks and testing reparameterisations.

2 Likes

Small step sizes and large tree depths can be a sign of degenerate posterior density functions, Identity Crisis, which suggest that at least some of your parameters are strongly coupled. This can sometimes be resolved with stronger priors, reparameterizations, or even more data, but the best path forwards will depend on the nature of the degeneracy itself.

2 Likes