I’m trying to get my head around the relationship between stepsize, treedepth & efficiency for a tough nonlinear model - tens of thousands of parameters with some strong correlations and only weakly identified for some parameters when trying to develop the model on smaller datasets.
It currently samples happily without any divergences, rhat, treedepth, ESS warnings etc, but it religiously takes a treedepth of 7-8 and it scales to larger datasets relatively poorly. I think I’ve already examined all relevant plots of location vs log(scale) for funnels on both small-n & large-n datasets, and many other pairs plots for degeneracies - I’m reasonably happy, though it’s certainly a tough model with strong correlations in some places - I may well have missed some 3D funnels.
I’m wondering if I can speed it up by scaling to increase the 0.01 stepsize & potentially reduce the necessary treedepth?
If it makes any sense to try to increase stepsize to improve sampling efficiency, how might I better identify which parameters to target, or which 3D combinations of variables are causing degeneracies?
- I’ve been looking at the diagonals of the mass matrix, but I’m unsure how they relate to variables - there are 3x more parameters in the posterior than entries in the mass matrix.
- I’ve looked at the posterior standard deviations & there are definitely some candidates for rescaling, but should the target be unit scale on the constrained, or unconstrained scales? i.e. should a real<lower=0,upper=1> x be rescaled to x ~ N(0,1) or logit(x) ~ N(0,1)?
- Any other suggestions of where to look? I’m thinking trawling through some trace plots for apparent correlations may be my next step.
The general goal is to set the step size as high as possible but stop short of introducing divergences (where the step size is so large the first-order leapfrog algorithm diverges from maintaining a constant Hamiltonian), because those typically introduce bias into posterior expectations. Given the step size, you want to run the tree depth deep enough to get to U-turns. Artificially truncating it usually leads to more diffusive sampling (faster iterations, but overall lower efficiency on an ESS/second basis).
You can try to identify parameters that have a low estimated ESS. Those are usually hierarchical parameters that are correlated to many lower-level parameters or parameters where there are multiplicative non-determinism (two parameters that multiply to condition another variable).
The mass matrix is N x N where N is the number of unconstrained parameters. Sampling happens in unconstrained space and that’s where you want to inspect the geometry.
Ideally the posterior will be standard normal (each parameter is independently normal(0, 1)).
You can run plain HMC rather than NUTS with different step sizes to see if you can get away with larger ones. Stan tends to be conservative in its step size estimates to prevent divergences. Traceplots are very hard to interpret by eye. Better is to do something like compute the condition of the inverse Hessian of the log density at each point in the posterior to probe log convexity. In some cases, it won’t be positive definite (when there’s non-convexity like saddle points). Those are hard to sample and would ideally be reparameterized. Tighter priors can also help sampling in hard problems.
Thanks for the quick response @Bob_Carpenter.
I worked out some of #1 after realising that generated quantities can’t be in the mass matrix - I’m still working on the ~300 extra entries which I can’t seem to grep from the parameter names. I’ll get there.
To clarify on 2 - you suggest the posterior is ideally N(0,1), or the distribution on the unconstrained parameter space would ideally be N(0,1)? It’d be a v big difference in scales for some variables e.g. scales.
#3 Thanks for the suggestions. I’d bet I can’t get away with larger ones without rescaling, but I’ll definitely try probing log convexity if I can work it out! Tighter priors are definitely on my to-do list in some areas, but softly-softly catchee monkey.
Just a few comments.
The ideal step size isn’t the largest possible step size that still yields stable numerical trajectories. The deal step size is the one that yields the most effective exploration, and this will be smaller than the stability threshold for nice problems. In particular the more accurate numerical trajectories that results from smaller step sizes allow for a higher probability of transitioning to points far away from the initial point, which can result in stronger performance. Because the transitions are probabilistic, however, there are diminishing returns on decreasing the step size. At some point the step size becomes small enough that the increasing the accuracy of the numerical trajectories doesn’t really improve the ability to jump to further states appreciably, and the extra cost of the more refined numerical trajectory becomes wasted. Fortunately all of these subtle trade offs are (approximately) accounted for in the step size adaptation.
Unstable numerical trajectories, however, complicate the relationship between accuracy and computational performance. Stan’s adaptation assumes that the numerical trajectories are stable, in which case the relationship can be quantified. Divergences near the naive optimal behavior indicate that this relationship doesn’t hold, in which case the adaptation tries to get away with the largest step size that doesn’t result in divergences (although doesn’t always do a good job due to the discontinuity in the adaptation objective function).
Most implementations of Hamiltonian Monte Carlo, including Stan’s, perform best when the posterior density function is a product of unit normals. In this case the optimal step size will depend on the dimension of the posterior density function, but once that optimal step size is found the performance will be hard to beat for any other posterior with the same dimension.
When the posterior is a product of normals with separate scales then the step size may have to drop for the numerical trajectories to resolve all of the scales at the same time. The adapted inverse metric (by default a diagonal metric with only one element for each dimension in the unconstrained model configuration space) in Stan’s implementation of Hamiltonian Monte Carlo, however, can compensate for those separate scales. If you see strong heterogeneity in the inverse metric elements then Stan’s adaptation has done it’s job! Rescaling the parameters at this point can make the adaptation less expensive but it shouldn’t change the post-adaptation performance much so long as the adaptation achieved a nearly optimal inverse metric.
More problematic are couplings between the parameters which can’t be scaled away. The stronger the coupling the more elongated the posterior distribution will appear; the explore this elongated geometry with Hamiltonian Monte Carlo we need smaller step sizes (to resolve the width of the posterior) and long numerical trajectories (to fully explore the breadth of the posterior). Funnels are a common source of this coupling, but they are in no way the only potential problem especially when integrating tens of thousands of parameters.
As I discuss in Identity Crisis pairs plots can be poorly informative for such high-dimensional models. The key to using them effectively is it work thought the structure of the model and identify possible sources of posterior coupling between parameters or even functions of parameters, and then use that to motivate particular pairs to analyze. Unfortunately it’s hard to say without seeing the model!