I’m trying to better understand what adapt_delta does and in particular, which acceptance probability it targets. Neither NUTS, nor the dynamic HMC implemented in Stan have an accept/reject step.
One could still compute either (i) the probability of accepting the final point of the trajectory in a hypothetical accept/reject step, (ii) the probability of accepting the sampled point in the trajectory, or (iii) the the average probability of accepting any point along a trajectory.
With NUTS and dynamic HMC, we generate a multiplicative trajectory expansion, but my understanding is that all the points generated along the way should appear in the final trajectory.
From the NUTS paper (Hoffman & Gelman 2014),
[The targeted probability] is the average acceptance probability that HMC would give to the position-momentum states explored during the final doubling iteration.
Ok, this makes sense and it seems consistent with what is described in Section 5 of @betanalpha’s conceptual introduction to HMC. But I would love some confirmation that this is indeed what we do for dynamic HMC and what is currently implemented in Stan.
adapt_delta is the target acceptance rate during warmup. It is the target the dual averaging algorithm tries to hit. The difference between historical acceptance rate and this target (similar to loss function. It’s an optimization process after all) is used to adjust step size. See Stan src here. delta_ in the code corresponds to adapt_delta. So yes, it’s in NUTS.
Dynamic Hamiltonian Monte Carlo, including NUTS and the current implementation in Stan, is not a Metropolis-Hastings method and so there is no well-defined notion of “acceptance probability”. On the other than the original “Hybrid Monte Carlo” implementation of Hamiltonian Monte Carlo is a Metropolis-Hastings method, using the terminal state of a numerical trajectory as a Metropolis proposal that is then accepted or rejected.
The existing analyses of the optimal integrator step size for Hamiltonian Monte Carlo is limited to this Hybrid Monte Carlo setting because it uses the number of Metropolis rejections as a measure of performance. Ultimately it derives bounds on the Metropolis acceptance probability that achieves optimal exploration per computational effort, and under nice enough conditions (no divergences, etc) the integrator step size can be tuned to achieve the the optimal acceptance probability.
In order to apply this result to the non-Metropolis-Hastings methods one has to come up with a way to approximate the more sophisticated Hamiltonian transitions as Metropolis transitions. The heuristic used in the original NUTS paper, and still used in Stan, considers a numerical trajectory as a collection of shorter trajectories between the initial state and each intermediate states. A transition from the initial state to an intermediate state can then be very roughly approximated by a Metropolis transition using the corresponding trajectory segment. Naively averaging over all of these possible Metropolis transitions using uniform weights results in (iii).
The problem with this heuristic is that the intermediate states, and hence corresponding proxy Metropolis transitions, are sampled uniformly. Instead they are sampled relative to the local value of the Hamiltonian function. Weighted the proxy Metropolis proposals in the same way better aligns the heuristic with the actual behavior of the sampler, resulting in an immediate improvement in performance.