Much lower effective sample size with 2.15

Just noticed this with 2.15. In 2.14 the following simple model consistently gets more than 2000 ess, typically around 2700:

model_code = 'parameters {real y;} model {y ~ normal(0,1);}'
model = pystan.StanModel(model_code=model_code, model_name="normal2")
model.sampling(iter=4000, chains=2)

With 2.15 it’s consistently less than 2000, around 1500. Was this expected?

edit: added a missing line of PyStan code (the second line)

Thanks for bringing this up.

We may need to adjust the default target accept rate. Michael changed the way the accept rate was calculated by excluding the initial state from the numerator and denominator.

I found that if you lower the adapt delta to 0.5 or so, it consistently performs well. But it’s pretty sensitive in the total n_eff, dropping off quickly below 0.5 and slowly dropping off above 0.5. What worries me is that turning up adapt delta doesn’t increase n_eff / iteration.

With a target adapt delta of 0.5 and 10K warmup iterations, it hits the acceptance rate reliably. But for a target adapt delta of 0.7, it consistently hits a higher acceptance rate of around 0.9 after 10K warmup iterations. With a target adapt delta of 0.4, it consistently hits a lower acceptance rate of around 0.2 after 10K warmup iterations.

1 Like

I just did the comparison in CmdStan to check and with the same seed (and across a few different seeds) I get exactly the same results for adaptation and effective sample size between the current develop and v2.14.0.

I think the bigger issue is that the adapt delta parameter now means something fundamentally different. As the number of leapfrog steps increases, they converge to the same value, but for low tree depth, they’re scaled pretty differently.

When I did what Michael suggested during the Stan meeting yesterday and crank dimensionality up to 100, then I get n_eff=1000 for 1000 post-warmup iterations (in one chain—sorry Andrew) for almost all of the y[1:100].

For the original run, I didn’t control for seed, but I did run a handful of times and the results were just like what Allen reported with consistently lower n_eff in 2.15. How did you verify that you were truly running two different versions? I’d be suspicious if you literally got exactly the same results.

I also found that with n=100, it hit the target adapt_delta of 0.8 consistently.

Yeah, cloned a fresh copy of cmdstan then

git pull -ff
git submodule init --update --recursive
git checkout v.2.14.0
make clean-all
make gauss
./gauss sample num_warmup=2000 num_samples=2000

I just repeated to be sure and again saw the same results compared to develop (running in an independent directory).

Technically there is no right answer as our current implementation of HMC is not a Metropolis scheme and ultimately more complex than the algorithm assumed in the optimal stepsize paper. Hence everything is a bit heuristic – the updated version is just a bit more robust heuristic.

Again there is a challenge with the low-dimensional Gaussian as the instability of the integrator is very close to the optimal step size so the optimization surface isn’t nice and flat. This stresses the dual averaging, leading to step sizes that are biased higher than optimal. Ultimately the low-dimensional Gaussian is easy and the defaults have been chosen to perform well on the more complex models that we’re really targeting.

I gather it would be good if @betanalpha could reproduce this? Or does it not matter? By my count, we’ve got 2/3 people observing this. If we count the CI as an additional vote, we’ve got 3/4 observing this.

It probably doesn’t matter – again look at 1D examples vs higher-dimensional examples. I’m happy to look into it more but I’ll need somebody to reproduce using the same seed, preferably in CmdStan.

We were also discussing this on some other list (or maybe I was just shouting over my shoulder to Michael?) I meant to report back here, but apparently didn’t. I ran a 100-dimensional standard normal and the n_eff are all equal to the number of sampling iterations. I don’t think 1-dimensional problems are really what we should be measuring—they should work, but it’s not what we want to optimize default tuning for.

Glad to hear it’s not a problem! I agree that higher dimensions is what
we care about.