I was debugging some models that seemed to be requiring a fairly high treedepth and came across a two slightly annoying features of the dual averaging algorithm. I’ve included the relevant formulas for reference.

During the adaptation phase, after each adaptation window the starting point for the step size is selected to be 10 times what the average stepsize (\bar x) from the previous window. I see how this makes sense: after the mass matrix is updated, it is reasonable to expect that a sligthly larger step size would be appropriate. However it causes some weird behavior. This step size is often way to large and results in an acceptance probability, \alpha being near zero. This causes a large increase in H, and a corresponding drop in x. In my models, it takes three or four steps for H to come back down to a reasonable value, and have \alpha \approx \gamma, and for H to begin decreasing. However, even once H is decreasing, the step size continues to fall due to the increased influence of the H term relative to \mu, meaning it takes a while for the step size to stabilize.

The other issue I am seeing is a “sawtooth” like behavior in the step size when using large values for \delta. This is because the range of the updates to H_t is assymetric, i.e. \left[\frac{\delta-1}{t+t_0},\frac{\delta}{t+t_0}\right]. After an iteration with a particularly low value of \alpha the step size can drop a lot, but it takes many iterations with \alpha near 1 to bring it back up again.

How was the multiplier of 10 between warmup rounds chosen? Are there any models that this is particularly helpful for? Is there anyway to make this a tuneable parameter to stan like the other adaptation parameters?

Is there a reason for the asymmetry in the updates to H? I was thinking if the update was done on \mbox{logit} (\delta) - \mbox{logit} (\alpha) or on \min(\delta-\alpha, 1-\delta).

I wanted to post here and see if people thought it is a good idea or if it has been tried already before I start trying to make changes to the adaptation algorithm.

The relevant parts from Hoffman & Gelman 2014

x_{t+1}\leftarrow \mu - \frac{\sqrt{t}}\gamma \frac{1}{t+t_0}\sum_{i=1}^t H_i

H_t \leftarrow \frac{1}{t+t_0}H_{t-1}+\frac{1}{t+t_0}(\delta-\alpha)