I have been talking to a colleague on why Stan (NUTS) is better than, say, RWMH, and found that I don’t understand the issue coherently, so I thought I would ask here. Broadly, the question is how to think about the break-even point in the dimension of the problem where NUTS/HMC is more efficient than derivative-free methods.
While NUTS/Stan takes more time per sample, ESS is better, so if things work well, time/effective sample will be lower. To make things concrete, let the likelihood be
ℓ:ℝᵈ→ℝ (for simplicity,
ℓ could be a multivariate normal).
f(d) denote the time/effective sample for a perfectly tuned RWMH, and similarly
g(d). From Neal’s chapter in the MCMC handbook (220.127.116.11), one can heuristically argue for
f(d) ≈ F d²,
g(d) ≈ G d^(5/4), for constant terms
G. Is this a good way to think about this?
Can we say anything about
G? Eg suppose we neglect the implementation overhead for RWMH and NUTS. How does
G depend on
d? Does it change the asymptotics above?