There is value in setting it to some multiple, especially early on in the adaptation in the first few adjustments of the mass matrix, since often the updated mass matrix can result in a higher optimal step size, and this allows that step size to be reached more quickly.
It’s quite possible that having this as a tuneable parameter would solve much of the problem. Part of the reason I end up with such low stepsizes that take forever to recover from is that the step size is set too high at the beginning of the iteration, and then the momentum sends the step size to be far too low.
However the other part of the problem happens in the initialization phase.
There are certainly pathologies in the far tails of the model. There is an order of magnitude difference in the log-probability between the start of the warmup and where the model settles by the time sampling starts. There are plenty of divergences during warmup, but they are gone in sampling. My understanding from comments on the forum and that Stan does not report divergences during warmup is that divergences during warmup do not present a significant problem in model fitting. In looking into modifications to the warmup algorithm, I’m trying to see if there is a way to prevent these pathologies from causing the algorithm to take a really long time to reach the typical set.
To illustrate, here are some example plots of the sampling parameters.
This is a well-behaving model
It took over a week to run, so I haven’t tested it on the new warmup configuration. Note how long it takes for the stepsize to recover in chain 3 after it initially drops down, and the “sawtooth” pattern that keeps bringing it lower after the initial drop.
Here is an example of an especially problematic model
Notice how the stepsize for chain 3 keeps decreasing until the first 10x multiplier. It still requires many more leapfrog steps throughout the warmup and sampling. Dots are divergent transitions. So the model still as some issues, but having the warmup take so long really gets in the way of running the model to look at the diagnostics.
Here is the same model with the symmetric step size adjustment
In this case, chain 4 drops down, but recovers much more quickly. Chain 3 still seems to be requiring more leapfrog steps than the others, but overall it runs much faster than when using the default step size adjustment algorithm.
The change I proposed is just changing the objective function of the optimization from minimizing E[||\delta-\alpha||] to E[||\min( \delta-\alpha, 1-\delta)||]].
Whether this is in general better or not, I don’t yet know. However, it produced results that were very compelling on my model, and thought it was worth sharing. I’ve done a lot of tweaks to the model to eliminate pathologies, and tried various combinations of changing the three algorithm parameters, as well as the scaling the warmup windows, and this has by far been the single greatest improvement to the model fitting.
Interestingly, Andrieu and Thoms 2004, referenced in the Hoffman paper, also suggest using H_t = \mathbb I_{\delta-\alpha_t > 0} - \mathbb I_{\delta-\alpha_t \leq 0}, which I guess would be equivalent to minimizing E[|\delta-\alpha|]. This would require setting \gamma to a higher value to keep the updates on the same scales. I’ll look into this a bit more, in looking how well the stepsize was working I had been looking at the average acceptance stat during sampling. It was usually much higher than \delta, but the step size optimizer is minimizing ||\delta-\alpha||, so I guess this to be expected whenever \delta > 0.5.
Let me be clear, I am not advocating changing Stan based on this test in one model. I merely noticed some remarkable improvement in my model and thought it was worth sharing and looking into more. Even if it does turn out that another objective function is better for a wide variety of simple and pathological models, I still would advocate including it as another option rather than switching over the default. The existing code has been in Stan so long and there are so many models that depend on it that changing the defaults in such a core part of the code would certainly break someones models somewhere when they upgraded.