Issue with dual averaging

Thanks. I had no idea that integrating past the stopping criterion could yield higher ESS per iteration.

Thanks everyone, I did read the original NUTS paper in the end :)

Superfficency tends to hold only for ESS for E[\theta], but not, e.g. for ESS for E[\theta^2] or ESS for e.g. 95% quantile. You could get better efficiency for many E[g(\theta)] than Monte Carlo with deterministic approaches, but if we stay in dynamic HMC, when you have superefficiency for E[\theta] it’s likely that ESS/s for E[\theta^2] is as good and by using shorter trajectories that decrease ESS for E[\theta] can at the same time improve ESS/s for E[\theta^2]. I guess it would be possible in similar way as NUTS criterion was derived by considering optimizing the expected distance between draws of \theta you could derive a criterion that would consider optimizing the expected distance between draws of \theta^2, too, which would lead to better balance for ESSs for both E[\theta] and E[\theta^2]. (the highest ESS for E[theta] I’ve observed with Stan’s dynamic HMC in case of multivariate normal is 3 times the nominal sample size, and seen reports with even higher values)

3 Likes

Keep in mind that in general there is not a single optimal integration time, but rather the optimal integration time can vary from point to point. Consequently while it may look like the optimal number of leap frog steps is just over 2^{7} it could also be that some of the trajectories actually have a smaller optimal number of steps and some have a larger number of optimal steps. In this case truncating the trajectory length at 2^{7} steps selectively reduces the efficacy of just those longer trajectories; this can not only reduce overall performance but also invalidate the MCMC error quantification on which we rely as in the heavy-tailed cases that I mentioned above.

But let’s say that your target distribution is pretty normal (as in similar to the normal density function) in which case a single iteration time will yield decent performance across all trajectories. Now the question is whether that optimal number of leapfrog steps really is right above 2^{7} or it in fact extends much further.

As @nhuurre noted the effective sample size could increase significantly increase as the trajectories are drawn out past 2^{7} steps (with the step size fixed). In that case the additional cost would be worth the increased performance, as least in terms of efficiency. If you didn’t need more effective sample sizes then you could get away with shorter, and hence less expensive, Markov chains.

But it could be that the optimal integration time actually is right past the 2^{7} threshold, in which case you would be suffering from the limitations of the particular multiplicative expansion used by the dynamic Hamiltonian Monte Carlo sampler. In this case one could avoid the dynamics integration times altogether and just run the static Hamiltonian Monte Carlo sampler with num_leapfrog = 128.

To further investigate you can look at not only the truedepth but also the total number of leapfrog steps which is also stored. In particular the correlations with these values with parameters can give you a sense of how dynamic the integration times might be and how amenable using the static sampler to target a single, precise integration time might be.

2 Likes