This isn’t just about Stan algorithms. I wanted to evaluate

Wu, Stoerer, and Robert. 2018. Faster Hamiltonian Monte Carlo by Learning Leapfrog Scale.

arXiv.

the punchline of the abstract says

When compared with the original NUTS on several benchmarks, this algorithm exhibits a significantly improved efficiency.

HMC uses a fixed step size and number of steps. It’s prone to fall into degenerate behavior with orbits and anti-orbits (see below for an illustration).

A common suggestion to mitigate this problem is to jitter either the step size or the number of steps (see, e.g., Neal’s MCMC using Hamiltonian dynamics).

Standard HMC is also notoriously difficult to adapt. Wu et al.'s approach is to let Stan’s current version of NUTS (not the same as in the original paper!) figure out the step size (and metric) using its warmup adaptation. Letting L be the the 95% quantile of number of steps taken by NUTS during warmup, they use uniform(1, L) leapfrog steps per iteration, with NUTS’s adapted stepsize, otherwise using standard HMC and its Metropolis-Hastings adjustment.

I only evaluated one problem so far, a 1000-dimensional normal. The jitter turns out to be super critical for stability, and NUTS turns out to do a spot-on job of estimating step size.

Here’s the evaluation, which is going to need some unpacking. It uses 500 iterations of sampling for all methods (NUTS, HMC, and HMC-jitter). It uses out-of-box defaults for NUTS, which is 1000 warmup iterations.

The top row is standard HMC with a fixed step size and number L of leapfrog steps. The bottom row is jittered HMC with uniform(1, L) steps. The columns represent L values of 4, 16, and 64 steps respectively. The x-axis represents step sizes between 0 and 1, and the vertical axis is expected effective sample size *per leapfrog step*. The blue lines represent performance for estimating \mathbb{E}[Y_1] and the red lines for \mathbb{E}[Y_1^2] (as a proxy for estimating variance), where Y_1 is the first dimension of Y \in \mathbb{R}^{1000}.

The horizontal blue dashed line represents default NUTS performance (which chooses 16 steps and a stepsize of 0.315, indicated by the vertical brown line) on Y_1 and red on Y_1^2.

The first thing to note is that NUTS produces far better effective sample size for Y_1 than Y_1^2. This is because its draws are anti-correlated because of the application of the no-U-turn condition that favors draws further away from the current draw. This bias in jumping helps with estimating the mean, but hurts in estimating variance.

The next thing to note is that HMC is really sensitive to step size. It’s actually not so much step size as long as the step size is stable — it’s step size times number of steps, i.e., which is the simulation time in the Hamiltonian. This has an exact solution, and smaller step sizes are better at approximating the true Hamiltonian trajectory. The spiky pattern for 64 draws is for multiples—at a small step size, the Hamiltonian makes essentially one orbit, then at multiples of that step size, it makes multiple orbits. This is not efficient for sampling.

On the other hand, we see that even basic HMC works well with L = 4 leapfrog steps. I believe this is because 4 is roughly the number of leapfrog steps away from the starting point that NUTS will select. NUTS does a multinomial draw of one of the positions along the leapfrog simulated Hamiltonian. The simulated Hamiltonian goes backward and forward and time at random in NUTS, doubling the number of leapfrog steps each NUTS step. This causes one direction in time to usually be longer than the other and then the multinomial sampling produces an average number of leapfrog steps away from the origin that is much smaller than the total leapfrog steps evaluated.

For example, suppose we go back (1), back(2), forward(4) back(8). Now we have total of 11 steps back in time and 4 steps forward. We can cheat and evaluate the expected number of steps from the origin in this case as

```
> (sum(sample(11)) + sum(sample(4)))/15
[1] 5.16667
```

We could do the Monte Carlo simulation of all the different ways the Hamiltonian can evolve in NUTS, but the answer’s gonig to be similar to this. I believe this is why 4 steps works so well. Even so, it only works as well as NUTS, not better. It’s a bit more efficient at the variances and a bit less efficient at mean estimation than NUTS.

Now what about Wu et al? That’s the bottom graph. Their algorithm would use the middle panel, because NUTS always uses a tree depth of 4 (corresponding to L = 16, which means 15 leapfrog steps—one thing we learn as computer scientists is fencepost counting!). Remember that I’ve scaled everything for compute time. With uniform(1, L) draws, the average number of leapfrog steps will be (L / 2) + 0.5 (this one’s easy if you spent your youth calculating odds for weird dice combos, or you could do `sum(sample(L)) / L`

in R to do it by brute force (see below).

I don’t see the improvement they’re touting. Optimal jittered HMC is roughly the same efficiency here as NUTS, though with less anti-correlation of draws so closer mean and variance effective sample sizes. uniform(1, 4) works even better than uniform(1, 16) here. I think in general about 1/3 of NUTS leapfrog steps would be about right for HMC.

Here’s figure 1 from Wu et al. which is also for a multivariate normal. eHMC is their approach (where the “e” is for “empirical”).

**More examples**

This is the simplest possible example. I need to at least consider varying curvature, which NUTS can somewhat handle by adapting the number of steps it takes.

But this approach does seem promising. For example, we can use a simple buttonhook exploration (follow Hamiltonian until you make a U-turn back to where you started) to estimate how many steps it takes to make a U-turn. That should be all we need for warmup of number of steps. And if we use jitter HMC and crank the number of steps down to the expected number of steps away from the starting point in NUTS, we get a pretty wide target of acceptable adaptation targets for step size.

**Appendix: Estimating effective sample size from standard error estimates**

This is done using 100 runs of each method, computing the MCMC estimate of the relevant expectations, and computing the standard deviation of the estimates. This is an estimate of the standard error of the method. Therefore, we can bakc out effective sample size by noting that

\displaystyle \mbox{mcmc-se}[U] = \frac{\mbox{sd}(U)}{\sqrt{N_{\mbox{eff}}(U)}},

so that

\displaystyle N_{\mbox{eff}} = \left( \frac{\mbox{sd}(U)}{\mbox{mcmc-se}[U]}\right)^2.

For the example of a standard normal, we know

\mbox{sd}[Y_1] = 1

and

\mbox{sd}[Y_1^2] = \sqrt{2}.

We known Y_1^2 \sim \mbox{chiSq}(1) because Y_1 \sim \mbox{normal}(0, 1), so we know \mbox{var}[Y_1^2] = 2 and hence \mbox{sd}[Y_1^2] = \sqrt{2}.

You can evaluate this does the right thing for independent draws. For example,

```
M <- 500
sq_err <- rep(NA, 10000)
sd_sq_err <- rep(NA, 10000)
for (i in 1:10000) {
y_mc <- rnorm(M)
sq_err[i] <- (mean(y_mc) - 0)^2
sd_sq_err[i] <- (mean(y_mc^2) - 1)^2
}
printf("Independent Draws: mc se Y = %4.3f; mc se Y^2 = %4.3f ess Y = %3.0f; ess Y^2 = %3.0f",
sqrt(mean(sq_err)), sqrt(mean(sd_sq_err)),
1 / mean(sq_err), 2 / mean(sd_sq_err))
```

which runs 10,000 simulations of random draws used to estimate mean and variance and calculates the standard error estimate and effective sample size estimates

```
1] Independent Draws: mc se Y = 0.045; mc se Y^2 = 0.064; ess Y = 493; ess Y^2 = 496
```

The ESS estimates are good as we used 500 simulation draws per run, so the sample sizes are known to be 500.

**Appendix 2: R Code**

Here’s all the R code I used to generate everything. It took a couple hours to run on my notebook with no attempts at multiprocessing, which should be relatively easy here.

I also evaluate independent draws but that can’t be put on the ESS/leapfrog scale because there are no leapfrog steps! And also acceptance rate, which I include below.

```
library(ggplot2)
printf <- function(pattern, ...) print(sprintf(pattern, ...), quote = FALSE)
# ESTIMATE MONTE CARLO STANDARD ERROR FOR Y, Y^2
#
M <- 500
sq_err <- rep(NA, 10000)
sd_sq_err <- rep(NA, 10000)
for (i in 1:10000) {
y_mc <- rnorm(M)
sq_err[i] <- (mean(y_mc) - 0)^2
sd_sq_err[i] <- (mean(y_mc^2) - 1)^2
}
printf("Independent Draws: mc se Y = %4.3f; mc se Y^2 = %4.3f; ess Y = %3.0f; ess Y^2 = %3.0f",
sqrt(mean(sq_err)), sqrt(mean(sd_sq_err)),
1 / mean(sq_err), 2 / mean(sd_sq_err))
# Stan run edited out to save time as I only need them once
# # ESTIMATE STAN MARKOV CHAIN MONTE CARLO STANDARD ERROR FOR Y, Y^2
#
# library(rstan)
# program <- "parameters { vector[1000] theta; } model { theta ~ normal(0, 1); }"
# model <- stan_model(model_code = program)
# J <- 100
# sq_err_y <- rep(NA, J)
# sq_err_y_sq <- rep(NA, J)
# for (j in 1:J) {
# printf("Stan run = %d / %J", j, J)
# stan_fit <- sampling(model, chains = 1, iter = 1500, warmup = 1000,
# refresh = 0, control = list(metric = "unit_e"))
# sim_theta_1 <- extract(fit)$theta[ , 1]
# y_hat <- mean(sim_theta_1)
# y_sq_hat <- mean(sim_theta_1^2)
# sq_err_y[j] <- y_hat^2
# sq_err_y_sq[j] <- (y_sq_hat - 1)^2
# }
# printf("Stan NUTS (default): mcmc se Y = %4.3f mcmc se Y^2 = %4.3f",
# sqrt(mean(sq_err_y)), sqrt(mean(sq_err_y_sq)))
# sampler_params <- get_sampler_params(stan_fit, inc_warmup = FALSE)
# HAMILTONIAN MONTE CARLO
momenta_lpdf <- function(p) sum(dnorm(p, log = TRUE))
hmc <- function(lpdf, grad_lpdf, init_params, epsilon, max_leapfrog_steps, num_draws, algorithm = "hmc") {
num_params <- length(init_params)
theta <- matrix(NA, num_draws, num_params); theta[1, ] <- init_params
accept <- 0; reject <- 0
for (draw in 2:num_draws) {
# algorithm selects number of steps per draw
if (algorithm == "hmc") {
leapfrog_steps <- max_leapfrog_steps
evals <- max_leapfrog_steps
} else if (algorithm == "hmc-jitter") {
leapfrog_steps = sample(max_leapfrog_steps, 1)
evals <- max_leapfrog_steps / 2 + 0.5
} else {
printf("hmc(): ERROR: unkown algorithm = %s", algorithm)
return(-1)
}
# random initial momentum drawn from marginal (std normal)
p1 <- rnorm(num_params) # initial momentum
# leapfrog algorithm to simulate Hamiltonian Dynamics
p <- p1 # momentum
q <- theta[draw - 1, ] # position
grad <- -grad_lpdf(q) # gradient of potential at initial position
for (l in 1:leapfrog_steps) {
p <- p - (epsilon / 2) * grad; # half step in momentum
q <- q + epsilon * p; # step in position
grad <- -grad_lpdf(q) # only update gradient here
p <- p - (epsilon / 2) * grad; # half step in momentum
}
pL <- p # final momentum
# Metropolis-Hastings accept step
accept_lp <-
( lpdf(q) - lpdf(theta[draw - 1, ])
+ momenta_lpdf(pL) - momenta_lpdf(p1) )
u <- runif(1) # uniform(0, 1) variate
if (log(u) < accept_lp) {
theta[draw, ] <- q
accept <- accept + 1
} else {
theta[draw, ] <- theta[draw - 1, ]
reject <- reject + 1
}
}
return(list(theta = theta,
accept_rate = accept / (accept + reject),
evals = evals))
}
# STANDARD NORMAL DENSITY AND GRADIENT
N <- 1000 # dimensionality
norm_lpdf <- function(theta) -0.5 * sum(theta^2)
grad_norm_lpdf <- function(theta) -theta
# ESTIMATE HMC STANDARD ERROR Y, Y^2
theta0 <- rnorm(N) # initialize from posterior, so chain is stationary
rmse_mean <- c()
rmse_sd <- c()
accept_rates <- c()
algorithms <- c("hmc", "hmc-jitter")
Ls <- c(4, 16, 64)
epsilons <- seq(0.01, 0.8, by = 0.01)
dfs <- vector(mode = "list", length = length(Ls) * length(epsilons) * length(algorithms))
M <- 500
K <- 100 # simulations per configuration
sq_err <- rep(NA, K)
sq_sq_err <- rep(NA, K)
accept_rate <- rep(NA, K)
evals <- NA
df_hmc <- data.frame()
for (algorithm in algorithms) {
for (L in Ls) { # number of steps
for (epsilon in epsilons) {
for (k in 1:K) {
fit <- hmc(norm_lpdf, grad_norm_lpdf, theta0,
epsilon = epsilon, max_leapfrog_steps = L, num_draws = M,
algorithm = algorithm)
theta_draws <- fit$theta[, 1] # first dimension
sq_err[k] <- (mean(theta_draws) - 0)^2 # true value 0
sq_sq_err[k] <- (mean(theta_draws^2) - 1)^2 # true value 1
accept_rate[k] <- fit$accept_rate
evals <- fit$evals
}
rmse_y <- sqrt(mean(sq_err))
rmse_y_sq <- sqrt(mean(sq_sq_err))
ess_y <- 1 / rmse_y^2 # normal(0, 1), so sd = 1
ess_y_sq <- 2 / rmse_y_sq^2 # chisq(1), so sd = sqrt(2)
ess_y_per_eval <- ess_y / (evals * M)
ess_y_sq_per_eval <- ess_y_sq / (evals * M)
avg_accept_rate <- mean(accept_rate)
df_hmc <- rbind(df_hmc,
data.frame(L = L, epsilon = epsilon, algorithm = algorithm,
accept_rate = avg_accept_rate,
rmse_y = rmse_y, rmse_y_sq = rmse_y_sq,
ess_y = ess_y, ess_y_sq = ess_y_sq,
ess_y_per_eval = ess_y_per_eval, ess_y_sq_per_eval = ess_y_sq_per_eval))
printf("%s: L %3.0f; epsilon %4.3f; rmse(Y) %4.3f; rmse(Y^2) %4.3f; ess(Y) %7.0f; ess(Y^2) %7.0f; ess(Y)/eval %9.3f; ess(Y^2)/eval % 9.3f; accept rate %4.3f",
algorithm, L, epsilon, rmse_y, rmse_y_sq, ess_y, ess_y_sq, ess_y_per_eval, ess_y_sq_per_eval, avg_accept_rate)
}
printf("")
}
}
# PLOT LEARNING RATE x RMSE (Y, Y^2) FACETED BY L & ALGORITHM
plot_hmc <-
ggplot(df_hmc, aes(x = epsilon, y = ess_y_per_eval)) +
facet_grid(algorithm ~ L) +
geom_hline(yintercept = 1480 / (16 * 500), color = "blue", size = 0.25, alpha = 1, linetype = "dashed") +
geom_hline(yintercept = 160 / (16 * 500), color = "red", size = 0.25, alpha = 1, linetype = "dashed") +
geom_vline(xintercept = 0.315, color = "darkgreen", size = 0.25) + # NUTS STEP SIZE
geom_line(color = "blue") +
geom_line(mapping = aes(x = epsilon, y = ess_y_sq_per_eval), color = "red") +
scale_x_continuous(breaks = c(0, 0.2, 0.4, 0.6, 0.8, 1.0)) +
scale_y_log10() + # breaks = c(1e0, 1e1, 1e2, 1e3, 1e4, 1e5),
# labels = c("1", "10", "100", "1,000", "10,000", "100,000")) +
xlab("step size (epsilon)") +
ylab("E[ESS] / leapfrog")
plot_hmc
```

The R’s rough—I’m happy to take suggestions for improving it.

You can run it if you want to see the acceptance rates. The acceptance rate is how Stan tunes, as in HMC it reduces to how well you’re simulating the Hamiltonian (divergences are when that simulation goes really wrong).