Adaptation of timestep in pure HMC

felixmueller · February 6, 2020, 2:12pm

I am working with a very simple system and fail to get the adaptation (of the stepsize) working.
For the record, my system consists of multiple univariate Gaussian variables with mean mu=0 and the provided data are the variances (model block: y ~ multi_normal(mu, matCov);)

Now when I use a stepsize which I know is good for the system I get a high acceptance rate and not very accurate sampling but that’s okay for now. However when I don’t provide a stepsize and want to adaptive algorithm to figure it out, it finds a stepsize which is two orders of magnitude bigger than the one I know is working (roughly 0.01 versus 1). This obviously leads to the acceptance rate being 0.
This happened with a) 1K burnin iterations with int_time=200 and also with b) 20K burnin iteration with int_time=1.

Are there any input setting for the adaptation I should be playing with? I know there are a couple but had too little insight to use them wisely yet.
Is the stepsize that the adaption finds in the same scale/units as the one I provide explicitely? Should I take care of something there?

Bob_Carpenter · February 8, 2020, 12:00am

The acceptance rate is the key thing to control (that’s parameter adapt_delta and setting it to 0.95 or 0.99 or higher should have some effect). When that goes up, stepsize has to go down.

Adaptation is on the unconstrained scale. That’s the same as the natural scale if there are no constrained parameters.

felixmueller · February 10, 2020, 9:38am

Thanks for the quick answer. I’ll try that out.

felixmueller · February 27, 2020, 1:12pm

Sadly, two simulation with adapt_delta = 0.95 and adapt_delta = 0.99 did not change the result. The resulting stepsize is very close to 1 and the acceptance rate is exactly 0.

As I do not have any constrained parameters, the scale does not seem to be a problem.

I double-checked the data entries, but even if they were wrong somehow (e.g. I mixed up variance and standard deviation), of course, the “good” stepsize would be a different one but nonetheless stan should be able to find a stepsize that provides a higher acceptance rate than 0. ;) I am baffled.

My model is incredibly simple so this is a last resort, but I will put it here and am very glad if someone double-checks my syntax. I provide a diagonal covariance matrix and want to sample from a multivariate normal distribution with mean=mu=0 and variance=sigma=‘the matrix I provide’. The model code is attached: model_Gaussian.stan (456 Bytes).

Max_Mantei · February 27, 2020, 4:47pm

Hey! Maybe I’m completely missing the point, but I don’t see what’s wrong.

Let’s simulate some data:

n_dim <- 50
matCovSquaredFreqs <- diag(rnorm(n_dim, 1, 1)^2)
stan_data <- list(n_dim = n_dim, matCovSquaredFreqs = matCovSquaredFreqs)

Running your model gives reasonable results. (By the way, you should never use multi_normal() with a diagonal matrix – that’s very inefficient. Actually, you could get the same results with any RNG, e.g. rnorm() in R – I guess I’m really missing the point here?)

Running with adapt_delta = 0.8 (default)

p0.8 <- sampling(sm, data = stan_data) # sm is the compiled stan model

tibble(Stan = summary(p0.8)$summary[-51,"sd"], 
      truth = sqrt(diag(matCovSquaredFreqs))) %>% 
  ggplot(aes(x=truth, y=Stan)) + 
  geom_abline() + 
  geom_point() +
  ggtitle("True SD's vs. Stan results (delta_adapt = 0.8)")

bayesplot::nuts_params(p0.8) %>% 
  filter(str_detect(Parameter, "accept|step")) %>% 
  ggplot(aes(x = Iteration, y = Value, color = factor(Chain))) + 
  geom_line(show.legend = F) + 
  facet_grid(Parameter~Chain, scales = "free") + 
  ggtitle("NUTS parameters with delta_adapt = 0.8 (default)")

Rplot

Rplot01

Running with adapt_delta = 0.9999

p0.9999 <- sampling(sm, data = stan_data, control = list(adapt_delta = 0.9999))

tibble(Stan = summary(p0.9999)$summary[-51,"sd"], 
       truth = sqrt(diag(matCovSquaredFreqs))) %>% 
  ggplot(aes(x=truth, y=Stan)) + 
  geom_abline() + 
  geom_point() +
  ggtitle("True SD's vs. Stan results (adapt_delta = 0.9999)")

bayesplot::nuts_params(p0.9999) %>% 
  filter(str_detect(Parameter, "accept|step")) %>% 
  ggplot(aes(x = Iteration, y = Value, color = factor(Chain))) + 
  geom_line(show.legend = F) + 
  facet_grid(Parameter~Chain, scales = "free") + 
  ggtitle("NUTS parameters with adapt_delta = 0.9999")

Rplot02

Rplot03

Hope this helps? If this went into a completely wrong direction, please feel free to ignore me. ;)

Cheers,
Max

felixmueller · February 28, 2020, 12:53pm

I appreciate your effort immensely, thanks! It some points it does help, it some points it does not.
I am aware that the structure and model I am using is absolutely overkill for what I want to do. I am atm trying to replicate results from my own HMC algorithm in stan, which is why I am going for this structure, it’s a toy problem so to say.

Because of this, I am not running with NUTS but only with the “basic” HMC algorithm in Stan. And because you showed that it does work (very well) using NUTS, the problem must be somewhere there.

I just put the file here, so that any obvious errors could be caught although the whole thing is so simple anyway which makes it more confusing to me that the adaptation does not give me a proper stepsize.

Max_Mantei · February 28, 2020, 1:07pm

Ahhh! Got it! And now is see that you wrote “pure HMC”. Okay, then just ignore my post. ;) Unfortunately, I’m also not able to help you with this. (The Stan model looks fine from what I can tell.)

I don’t know if he’s got time to chime in, but I bet @betanalpha has all the answers – or he can point you to them. ;)

betanalpha · February 28, 2020, 4:13pm

A short word on notation: “pure” and “basic” aren’t well-defined and hence prone to confusion. The question being asked here is about HMC with a static integration time. One of the advantages of Stan is that we can avoid these confusions by showing the full Stan call with all of its options.

Models with static integration time can be problematic if the integration time isn’t chosen well enough, especially for a Gaussian target density where the Hamiltonian trajectories can be limited to tiny orbits if the variances are related by integer ratios. Without knowing the data variances it’s hard to say exactly how much the algorithm is being stressed during adaptation and how reasonable the empirical results might be.

The adaptation routines themselves are the same for HMC with dynamic and static integration times, so problems are most likely due to limited exploration early on. The default adaptation configuration is tuned for dynamic HMC and so might be too aggressive for static HMC.

To do anything more than speculate I would need to see the exact data variances, the full Stan call, the output diagnostics (divergences, etc), and the adapted step size and inverse metric (accessible from the fit object).

Topic		Replies	Views
Details on how Stan adaptively tunes the HMC parameters? (i.e. mass matrix, step size and leapfrog steps) General algorithms	4	715	February 20, 2023
Step size continues to change during the sampling period when stepsize!=1 CmdStan	4	445	September 12, 2023
Using a fixed step size for HMC (or NUTS) and other sampler options Algorithms mcmc	21	3477	June 27, 2019
Issue with dual averaging Algorithms	63	4188	April 12, 2021
Adapt_delta Modeling	4	7234	March 7, 2023

Adaptation of timestep in pure HMC

Running with adapt_delta = 0.8 (default)

Running with adapt_delta = 0.9999

Related topics