So, we have a model that takes a long time to run.
- The gradient evaluation is about 0.2 seconds (*)
- We are currently running it for about 250 samples in both warmup and sampling
- The adaptation typically “goes well” in the sense that multiple chains converge on the same step size and inverse metric
- We don’t encounter divergences
- Our chains mix well (Rhats and Neff are both good)
- We have a very large number of parameters (~35,000) (**)
However we often have runs that saturate the treedepth. This leads to very long runtimes. About 29 hours (0.2 * 1023 * 500 / 60 / 60 ~= 29.)
Here are some things that we have tried:
- Doing a VB run (or a short sampling run) before sampling to rescale transformed parameters such that parameters have ~unit variance on the unconstrained scale
- Doing a VB run before sampling, taking the posterior covariance matrix (which is ~diagonal) and supplying it as the
inv_metric
to the sampling call - Doing (1) then (2)
We have not touched the step_size
or adapt_delta
arguments.
Despite doing (1) and/or (2) we often end up with small stepsizes (~0.005) and resulting treedepth saturation. Because of the lack of divergences and good mixing (as well as priors and posteriors that match our expectations), I am not concerned about the validity of the model. But the 29h thing is a bit … tough, especially since this model is both under active development and something we use in production. Yay.
So, the question is: what can we do to get larger step sizes to avoid saturating the treedepth?
(*) we have done a ton to try to reduce this, including attempts to use map_rect
and reduce_sum
, but this led to degradations in performance since the CPU was spending more time shuttling data around than doing computations. Until stan gets a conv1d function that handles an autodiff through a convolution more efficiently, or an FFT function, we’re limited here)
(**) but a very large number of these have relatively minimal impact on the model (many are std_normal variables getting multiplied by a cholesky decomposition of a covariance matrix)