I have been talking to a colleague on why Stan (NUTS) is better than, say, RWMH, and found that I don’t understand the issue coherently, so I thought I would ask here. Broadly, the question is how to think about the break-even point in the dimension of the problem where NUTS/HMC is more efficient than derivative-free methods.

While NUTS/Stan takes more time per sample, ESS is better, so if things work well, time/effective sample will be lower. To make things concrete, let the likelihood be `ℓ:ℝᵈ→ℝ`

(for simplicity, `ℓ`

could be a multivariate normal).

Let `f(d)`

denote the time/effective sample for a perfectly tuned RWMH, and similarly `g(d)`

. From Neal’s chapter in the MCMC handbook (5.4.4.2), one can heuristically argue for `f(d) ≈ F d²`

, `g(d) ≈ G d^(5/4)`

, for constant terms `F`

and `G`

. Is this a good way to think about this?

Can we say anything about `F`

and `G`

? Eg suppose we neglect the implementation overhead for RWMH and NUTS. How does `G`

depend on `d`

? Does it change the asymptotics above?