HMC (jittered) vs. NUTS on 1000-dimensional standard normal

This isn’t just about Stan algorithms. I wanted to evaluate

Wu, Stoerer, and Robert. 2018. Faster Hamiltonian Monte Carlo by Learning Leapfrog Scale. arXiv.

the punchline of the abstract says

When compared with the original NUTS on several benchmarks, this algorithm exhibits a significantly improved efficiency.

HMC uses a fixed step size and number of steps. It’s prone to fall into degenerate behavior with orbits and anti-orbits (see below for an illustration).

A common suggestion to mitigate this problem is to jitter either the step size or the number of steps (see, e.g., Neal’s MCMC using Hamiltonian dynamics).

Standard HMC is also notoriously difficult to adapt. Wu et al.'s approach is to let Stan’s current version of NUTS (not the same as in the original paper!) figure out the step size (and metric) using its warmup adaptation. Letting L be the the 95% quantile of number of steps taken by NUTS during warmup, they use uniform(1, L) leapfrog steps per iteration, with NUTS’s adapted stepsize, otherwise using standard HMC and its Metropolis-Hastings adjustment.

I only evaluated one problem so far, a 1000-dimensional normal. The jitter turns out to be super critical for stability, and NUTS turns out to do a spot-on job of estimating step size.

Here’s the evaluation, which is going to need some unpacking. It uses 500 iterations of sampling for all methods (NUTS, HMC, and HMC-jitter). It uses out-of-box defaults for NUTS, which is 1000 warmup iterations.

The top row is standard HMC with a fixed step size and number L of leapfrog steps. The bottom row is jittered HMC with uniform(1, L) steps. The columns represent L values of 4, 16, and 64 steps respectively. The x-axis represents step sizes between 0 and 1, and the vertical axis is expected effective sample size per leapfrog step. The blue lines represent performance for estimating \mathbb{E}[Y_1] and the red lines for \mathbb{E}[Y_1^2] (as a proxy for estimating variance), where Y_1 is the first dimension of Y \in \mathbb{R}^{1000}.

The horizontal blue dashed line represents default NUTS performance (which chooses 16 steps and a stepsize of 0.315, indicated by the vertical brown line) on Y_1 and red on Y_1^2.

The first thing to note is that NUTS produces far better effective sample size for Y_1 than Y_1^2. This is because its draws are anti-correlated because of the application of the no-U-turn condition that favors draws further away from the current draw. This bias in jumping helps with estimating the mean, but hurts in estimating variance.

The next thing to note is that HMC is really sensitive to step size. It’s actually not so much step size as long as the step size is stable — it’s step size times number of steps, i.e., which is the simulation time in the Hamiltonian. This has an exact solution, and smaller step sizes are better at approximating the true Hamiltonian trajectory. The spiky pattern for 64 draws is for multiples—at a small step size, the Hamiltonian makes essentially one orbit, then at multiples of that step size, it makes multiple orbits. This is not efficient for sampling.

On the other hand, we see that even basic HMC works well with L = 4 leapfrog steps. I believe this is because 4 is roughly the number of leapfrog steps away from the starting point that NUTS will select. NUTS does a multinomial draw of one of the positions along the leapfrog simulated Hamiltonian. The simulated Hamiltonian goes backward and forward and time at random in NUTS, doubling the number of leapfrog steps each NUTS step. This causes one direction in time to usually be longer than the other and then the multinomial sampling produces an average number of leapfrog steps away from the origin that is much smaller than the total leapfrog steps evaluated.

For example, suppose we go back (1), back(2), forward(4) back(8). Now we have total of 11 steps back in time and 4 steps forward. We can cheat and evaluate the expected number of steps from the origin in this case as

> (sum(sample(11)) + sum(sample(4)))/15

[1] 5.16667

We could do the Monte Carlo simulation of all the different ways the Hamiltonian can evolve in NUTS, but the answer’s gonig to be similar to this. I believe this is why 4 steps works so well. Even so, it only works as well as NUTS, not better. It’s a bit more efficient at the variances and a bit less efficient at mean estimation than NUTS.

Now what about Wu et al? That’s the bottom graph. Their algorithm would use the middle panel, because NUTS always uses a tree depth of 4 (corresponding to L = 16, which means 15 leapfrog steps—one thing we learn as computer scientists is fencepost counting!). Remember that I’ve scaled everything for compute time. With uniform(1, L) draws, the average number of leapfrog steps will be (L / 2) + 0.5 (this one’s easy if you spent your youth calculating odds for weird dice combos, or you could do sum(sample(L)) / L in R to do it by brute force (see below).

I don’t see the improvement they’re touting. Optimal jittered HMC is roughly the same efficiency here as NUTS, though with less anti-correlation of draws so closer mean and variance effective sample sizes. uniform(1, 4) works even better than uniform(1, 16) here. I think in general about 1/3 of NUTS leapfrog steps would be about right for HMC.

Here’s figure 1 from Wu et al. which is also for a multivariate normal. eHMC is their approach (where the “e” is for “empirical”).

More examples

This is the simplest possible example. I need to at least consider varying curvature, which NUTS can somewhat handle by adapting the number of steps it takes.

But this approach does seem promising. For example, we can use a simple buttonhook exploration (follow Hamiltonian until you make a U-turn back to where you started) to estimate how many steps it takes to make a U-turn. That should be all we need for warmup of number of steps. And if we use jitter HMC and crank the number of steps down to the expected number of steps away from the starting point in NUTS, we get a pretty wide target of acceptable adaptation targets for step size.

Appendix: Estimating effective sample size from standard error estimates

This is done using 100 runs of each method, computing the MCMC estimate of the relevant expectations, and computing the standard deviation of the estimates. This is an estimate of the standard error of the method. Therefore, we can bakc out effective sample size by noting that

\displaystyle \mbox{mcmc-se}[U] = \frac{\mbox{sd}(U)}{\sqrt{N_{\mbox{eff}}(U)}},

so that

\displaystyle N_{\mbox{eff}} = \left( \frac{\mbox{sd}(U)}{\mbox{mcmc-se}[U]}\right)^2.

For the example of a standard normal, we know

\mbox{sd}[Y_1] = 1

and

\mbox{sd}[Y_1^2] = \sqrt{2}.

We known Y_1^2 \sim \mbox{chiSq}(1) because Y_1 \sim \mbox{normal}(0, 1), so we know \mbox{var}[Y_1^2] = 2 and hence \mbox{sd}[Y_1^2] = \sqrt{2}.

You can evaluate this does the right thing for independent draws. For example,

M <- 500
sq_err <- rep(NA, 10000)
sd_sq_err <- rep(NA, 10000)
for (i in 1:10000) {
  y_mc <- rnorm(M)
  sq_err[i] <- (mean(y_mc) - 0)^2
  sd_sq_err[i] <- (mean(y_mc^2) - 1)^2
}
printf("Independent Draws:  mc se Y = %4.3f;  mc se Y^2 = %4.3f ess Y = %3.0f;  ess Y^2 = %3.0f",
       sqrt(mean(sq_err)), sqrt(mean(sd_sq_err)),
       1 / mean(sq_err), 2 / mean(sd_sq_err))

which runs 10,000 simulations of random draws used to estimate mean and variance and calculates the standard error estimate and effective sample size estimates

1] Independent Draws:  mc se Y = 0.045;  mc se Y^2 = 0.064;  ess Y = 493;  ess Y^2 = 496

The ESS estimates are good as we used 500 simulation draws per run, so the sample sizes are known to be 500.

Appendix 2: R Code

Here’s all the R code I used to generate everything. It took a couple hours to run on my notebook with no attempts at multiprocessing, which should be relatively easy here.

I also evaluate independent draws but that can’t be put on the ESS/leapfrog scale because there are no leapfrog steps! And also acceptance rate, which I include below.

library(ggplot2)

printf <- function(pattern, ...) print(sprintf(pattern, ...), quote = FALSE)

# ESTIMATE MONTE CARLO STANDARD ERROR FOR Y, Y^2
#
M <- 500
sq_err <- rep(NA, 10000)
sd_sq_err <- rep(NA, 10000)
for (i in 1:10000) {
  y_mc <- rnorm(M)
  sq_err[i] <- (mean(y_mc) - 0)^2
  sd_sq_err[i] <- (mean(y_mc^2) - 1)^2
}
printf("Independent Draws:  mc se Y = %4.3f;  mc se Y^2 = %4.3f;  ess Y = %3.0f;  ess Y^2 = %3.0f",
       sqrt(mean(sq_err)), sqrt(mean(sd_sq_err)),
       1 / mean(sq_err), 2 / mean(sd_sq_err))


# Stan run edited out to save time as I only need them once

# # ESTIMATE STAN MARKOV CHAIN MONTE CARLO STANDARD ERROR FOR Y, Y^2
#
# library(rstan)
# program <- "parameters { vector[1000] theta; } model { theta ~ normal(0, 1); }"
# model <- stan_model(model_code = program)
# J <- 100
# sq_err_y <- rep(NA, J)
# sq_err_y_sq <- rep(NA, J)
# for (j in 1:J) {
#   printf("Stan run = %d / %J", j, J)
#   stan_fit <- sampling(model, chains = 1, iter = 1500, warmup = 1000,
#                        refresh = 0, control = list(metric = "unit_e"))
#   sim_theta_1 <- extract(fit)$theta[ , 1]
#   y_hat <- mean(sim_theta_1)
#   y_sq_hat <- mean(sim_theta_1^2)
#   sq_err_y[j] <- y_hat^2
#   sq_err_y_sq[j] <- (y_sq_hat - 1)^2
# }
# printf("Stan NUTS (default):  mcmc se Y = %4.3f  mcmc se Y^2 = %4.3f",
#        sqrt(mean(sq_err_y)), sqrt(mean(sq_err_y_sq)))
# sampler_params <- get_sampler_params(stan_fit, inc_warmup = FALSE)


# HAMILTONIAN MONTE CARLO

momenta_lpdf <- function(p) sum(dnorm(p, log = TRUE))

hmc <- function(lpdf, grad_lpdf, init_params, epsilon, max_leapfrog_steps, num_draws, algorithm = "hmc") {
  num_params <- length(init_params)
  theta <- matrix(NA, num_draws, num_params);  theta[1, ] <- init_params

  accept <- 0;  reject <- 0
  for (draw in 2:num_draws) {

    # algorithm selects number of steps per draw
    if (algorithm == "hmc") {
      leapfrog_steps <- max_leapfrog_steps
      evals <- max_leapfrog_steps
    } else if (algorithm == "hmc-jitter") {
      leapfrog_steps = sample(max_leapfrog_steps, 1)
      evals <- max_leapfrog_steps / 2 + 0.5
    } else {
      printf("hmc(): ERROR: unkown algorithm = %s", algorithm)
      return(-1)
    }

    # random initial momentum drawn from marginal (std normal)
    p1 <- rnorm(num_params)             # initial momentum

    # leapfrog algorithm to simulate Hamiltonian Dynamics
    p <- p1                             # momentum
    q <- theta[draw - 1, ]              # position
    grad <- -grad_lpdf(q)               # gradient of potential at initial position
    for (l in 1:leapfrog_steps) {
      p <- p - (epsilon / 2) * grad;    # half step in momentum
      q <- q + epsilon * p;             # step in position
      grad <- -grad_lpdf(q)             # only update gradient here
      p <- p - (epsilon / 2) * grad;    # half step in momentum
    }
    pL <- p                             # final momentum

    # Metropolis-Hastings accept step
    accept_lp <-
      ( lpdf(q) - lpdf(theta[draw - 1, ])
        + momenta_lpdf(pL) - momenta_lpdf(p1) )
    u <- runif(1)                       # uniform(0, 1) variate
    if (log(u) < accept_lp) {
      theta[draw, ] <- q
      accept <- accept + 1
    } else {
      theta[draw, ] <- theta[draw - 1, ]
      reject <- reject + 1
    }

  }

  return(list(theta = theta,
              accept_rate = accept / (accept + reject),
	      evals = evals))
}


# STANDARD NORMAL DENSITY AND GRADIENT
N <- 1000                               # dimensionality
norm_lpdf <- function(theta) -0.5 * sum(theta^2)
grad_norm_lpdf <- function(theta) -theta


# ESTIMATE HMC STANDARD ERROR Y, Y^2
theta0 <- rnorm(N)                      # initialize from posterior, so chain is stationary
rmse_mean <- c()
rmse_sd <- c()
accept_rates <- c()
algorithms <- c("hmc", "hmc-jitter")
Ls <- c(4, 16, 64)
epsilons <- seq(0.01, 0.8, by = 0.01)
dfs <- vector(mode = "list", length = length(Ls) * length(epsilons) * length(algorithms))
M <- 500
K <- 100                               # simulations per configuration
sq_err <- rep(NA, K)
sq_sq_err <- rep(NA, K)
accept_rate <- rep(NA, K)
evals <- NA
df_hmc <- data.frame()
for (algorithm in algorithms) {
  for (L in Ls) {                    # number of steps
    for (epsilon in epsilons) {
      for (k in 1:K) {
        fit <- hmc(norm_lpdf, grad_norm_lpdf, theta0,
                   epsilon = epsilon, max_leapfrog_steps = L, num_draws = M,
		   algorithm = algorithm)
        theta_draws <- fit$theta[, 1]                         # first dimension
        sq_err[k] <- (mean(theta_draws) - 0)^2                # true value 0
        sq_sq_err[k] <- (mean(theta_draws^2) - 1)^2           # true value 1
        accept_rate[k] <- fit$accept_rate
	evals <- fit$evals
      }
      rmse_y <- sqrt(mean(sq_err))
      rmse_y_sq <- sqrt(mean(sq_sq_err))
      ess_y <-  1 / rmse_y^2       # normal(0, 1), so sd = 1
      ess_y_sq <- 2 / rmse_y_sq^2  # chisq(1), so sd = sqrt(2)
      ess_y_per_eval <- ess_y / (evals * M)
      ess_y_sq_per_eval <- ess_y_sq / (evals * M)
      avg_accept_rate <- mean(accept_rate)
      df_hmc <- rbind(df_hmc,
                      data.frame(L = L, epsilon = epsilon, algorithm = algorithm,
                                 accept_rate = avg_accept_rate,
		                 rmse_y = rmse_y, rmse_y_sq = rmse_y_sq,
		                 ess_y = ess_y, ess_y_sq = ess_y_sq,
		                 ess_y_per_eval = ess_y_per_eval, ess_y_sq_per_eval = ess_y_sq_per_eval))
      printf("%s:  L %3.0f; epsilon %4.3f;  rmse(Y) %4.3f;  rmse(Y^2) %4.3f;  ess(Y) %7.0f;  ess(Y^2) %7.0f;  ess(Y)/eval %9.3f; ess(Y^2)/eval % 9.3f; accept rate %4.3f",
             algorithm, L, epsilon, rmse_y, rmse_y_sq, ess_y, ess_y_sq, ess_y_per_eval, ess_y_sq_per_eval, avg_accept_rate)
    }
    printf("")
  }
}

# PLOT LEARNING RATE x RMSE (Y, Y^2) FACETED BY L & ALGORITHM

plot_hmc <-
  ggplot(df_hmc, aes(x = epsilon, y = ess_y_per_eval)) +
  facet_grid(algorithm ~ L) +
  geom_hline(yintercept = 1480 / (16 * 500), color = "blue", size = 0.25, alpha = 1, linetype = "dashed") +
  geom_hline(yintercept = 160 / (16 * 500), color = "red", size = 0.25, alpha = 1, linetype = "dashed") +
  geom_vline(xintercept = 0.315, color = "darkgreen", size = 0.25) +  # NUTS STEP SIZE
  geom_line(color = "blue") +
  geom_line(mapping = aes(x = epsilon, y = ess_y_sq_per_eval),  color = "red") +
  scale_x_continuous(breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0)) +
  scale_y_log10() +                # breaks = c(1e0, 1e1, 1e2, 1e3, 1e4, 1e5),
                                   # labels = c("1", "10", "100", "1,000", "10,000", "100,000")) +
  xlab("step size (epsilon)") +
  ylab("E[ESS] / leapfrog")
plot_hmc

The R’s rough—I’m happy to take suggestions for improving it.

You can run it if you want to see the acceptance rates. The acceptance rate is how Stan tunes, as in HMC it reduces to how well you’re simulating the Hamiltonian (divergences are when that simulation goes really wrong).

9 Likes

This is a great. I just have a few comments.

I think it will be helpful to separate out static HMC methods (which use a fixed number of leapfrog steps) from the “vintage” HMC method that used a fixed number of leapfrog steps and only considered the end point of the trajectory in a Metropolis proposal. Even with a fixed number of leapfrog steps one can improve considerably on vintage HMC by sampling from the entire trajectory instead of just using the end point.

Speaking of sampling points from a trajectory, the current implementation in Stan isn’t exactly a multinomial sample from the entire trajectory. Following the original NUTS implementation it instead first samples a sub-trajectory (i.e. one of the trajectory expansions) with a bias towards later sub-trajectories and then samples a point from that sub-trajectory. See Section A.3 of https://arxiv.org/abs/1701.02434. This is largely what causes the antithetical behavior for the means.

When considering the optimal number of leapfrog steps the vintage HMC one has to consider the error in the simulated trajectory. The expected jump isn’t quite uniform because the error has a pretty systematic structure – it starts off small, grows, then shrinks again as you move towards the other side of the energy level set. See Figure 5 of https://arxiv.org/abs/1411.6669 for an illustration of the behavior (which is actually exact for a 1D Gaussian target). In other words when the Metropolis acceptance step is taken into account the expected jump distance will stay close to initial point for a while before jumping further away once the number of leapfrog steps get long enough to probe the other side of the level set.

Also, if you extend the step sizes under consideration past 2 you’ll see the instability of the symplectic integrator arise and cause the effective sample size to plummet (although it won’t really mean much anymore as you’ll no longer have unbiased estimators).

2 Likes

Does that work in general? How do you show it has the right stationary distribution? I tried to work it out and failed as I couldn’t get it to look reversible.

Right—this is super important. And it’s a very strong bias, pretty much always selecting from the last doubling if it goes through.

I intentionally edited those out. I’ll have to digest the last points about having to take the leapfrog integrator error into account.

1 Like

I tried doing something along those lines because I thought NUTS was building some unnecessarily long and costly trajectories. Granted, it was a very crude version where a first run of NUTS was done and then based on the trace of leapfrog steps a static number was chosen – highest values, mean, whatever, it didn’t work very well and I decided to use preliminary runs and then fix the step size and mass matrix instead (which was helpful overall but kind of addressed a different problem).

But that got me thinking about the amount of computation and tuning used in NUTS vs static HMC, and maybe there’s another kind of compromise: running burn in normally with NUTS alone, fixing step size and mass matrix, then fixing number of steps for a certain number of iterations (not unlike the tuning interval in Metropolis-Hastings) and running NUTS at this iteration interval to “recalibrate” the number of leapfrog steps. In addition to that, I was thinking that the No U-Turn criterion could still be computed at every iteration to determine if HMC was wasting computations, in which case the next step would be a NUTS step regardless of it being less than the predetermined “tuning interval”.

At the limit where the tuning interval is one iteration, that of course reduces to NUTS, but even too large tuning intervals would still choose the number of steps dynamically every time the HMC trajectory U-turned. The only disadvantage would be static trajectories that were too short – those would have to wait the tuning interval to be corrected.

I think it would definitely remain to be seen if in practice this trade-off increases or decreases efficiency compared to NUTS throughout (at it may very well be problem dependent). I wanted to implement a standalone NUTS and HMC sampler both for teaching purposes and to mess with this kind of thing, as well as benchmarking HMC implementations against equivalent (in terms of language, libraries, etc) MH methods, but I didn’t have the time yet to do it and test different models.

Anyway, just some random ideas floated around, for what it is worth (if anything).

This is exactlyw hat Changye Wu did in his thesis with Christian Robert. Changye took the 0.95 quantile of number of steps, then used that with simple HMC that randomized the number of steps.

We do something like this now in adapting step size after a fixed number of iterations (which grows exponentially during warmup).

You want to evaluate some cases with varying curvature, too.

HMC is really easy to implement. NUTS isn’t that bad, but it has a lot of subtle indexing and comparison and biasing details which add up to making it a lot harder to code.

Right, but the idea would be to fix number of steps (not step size) for a predetermined interval and then tune it with NUTS, unless at any given iteration the static HMC trajectory made a U-Turn by whatever metric is being used.
So after burn-in (with fixed step size and mass matrix) it would mix static and NUTS iterations – although I refrain from calling it “hybrid” to avoid confusion with the original HMC, and “soft NUTS” is an even worse name, but it doesn’t need one since it doesn’t exist yet, so whatever.

I have implemented basic (unit matrix, fixed time step) versions of HMC and NUTS, but since the main purpose would be teaching the optimizations make it a little less readable and I preferred a functional (as in functional programming) version that looked more like the pure math descriptions in most papers

I just meant the infrastructure’s already in place in our C++ code to do these blockwise optimizations.

I don’t suspect there’d be much utiliity in mixing in static HMC for most hard problems.

Well, C++ is a bit unwieldy to me, and to be honest even if I make the effort for myself, for students in an introductory Bayesian inference that level of programming knowledge is likely to be too advanced. Like you said, HMC and NUTS implementations are reasonably straightforward, they could probably be written in a hundred or so lines of procedural code, maybe a little more in a functional implementation, but with OO I think it becomes more complicated than it needs to be for teaching purposes.

Maybe not, but it will save some computation. Whether it is worth it or not remains to be seen, I guess.

Definitely not a good teaching language! Also, not a good OO language. The main reason to use C++ is that you can effectively manage your own memory and use templates to resolve branch points statically and build expression templates. About the only place Stan uses any OO constructs (other than the odd handwritten closure here and there) is in I/O (for allowing specializations of our base variable context) and in writing new autodiff functions (which have to pay a virtualization penalty somewhere).

MacKay’s book on information theory has a simple HMC implementation in a couple dozen lines of code max. I have a simple R implementation in the original post that does a lot of recordkeeping for the plots, but is also pretty short. NUTS is just a lot more complicated.

1 Like

Yup, see Section A.2.3 of the Conceptual Introduction. The proof is pretty straightforward once you break it down into two stages – sampling a random numerical trajectory and then sampling a point from that trajectory.

True, although that’s largely due to the fact that we’re doubling and not expanding the trajectories with a smaller set of points.

For the MCMC wonks out there this same structure shows up in Metropolis methods. You can construct a valid algorithm by defining the acceptance probability using the Barker criterion,

\mathbf{P}[\text{accept}] = \frac{ \pi(x') }{ \pi(x) + \pi(x') }

or by using the Metropolis criterion,

\mathbf{P}[\text{accept}] = \frac{ \pi(x') }{ \pi(x)}

For reversible chains the Metropolis criterion strictly dominates in terms of spectral gap, and hence convergence speed.

The strict domination doesn’t necessarily carry over to this application, but the parallel is still a fun one.

Embrace the divergence. :-D

1 Like