# Momentum Diffusion HMC?

@betanalpha and other HMC experts

I’ve been thinking a bunch about why my large dimensional models have issues in getting into the typical set and sampling without divergence, and why large treedepth is needed etc. I see in the manual that simplexes often require these small steps for stability…

But in thinking about all of this it occurred to me that one possible solution to the long treedepth and soforth would be some kind of momentum diffusion / thermostat algorithm.

As I understand it, HMC lifts the parameter space (location) to a product space in which we have location and momentum… iterates the proposal forward and backward… then projects back down to location space and chooses one of the points on the trajectory according to multinomial distribution based on the lp (something like take the lp along the trajectory, subtract the max for numerical stability, exponentiate, then normalize to sum = 1). The whole trajectory is on a single energy level.

Now, suppose instead of discrete jumps in momentum space (lift, evolve, project) where lift and project are discrete jumps… instead we had something like a diffusion in momentum?

call q location and p momentum then we can have something like (ignoring the correct specifics of symplectic integrators, hopefully experts would fill in details here appropriately)

q[i+1] = q[i] + dt p[i]
p[i+1] = p[i] + dt F(q[i]) + dp[i]

where F, the generalized force, comes from the negative gradient of the log-probability and dp is a random perturbation of the momentum at each time point.

All this reminds me of some ideas I read about a while ago by Denis Evans regarding thermostats in molecular dynamics. In particular, it seems that we want our trajectory to be such that the potential energies are all about the same and equal to the potential energy in the typical set. So, the idea would be to seek out a “temperature” for a “thermal bath” that leads to our trajectories flying around in the typical set. Then, after flying around for a while, simply sub-sampling the one LONG trajectory using multinomial method with probability determined by the lp of the points would get us a sample that “works”

Furthermore, this would actually use all the intermediate points along the path, instead of just one point from each level set.

I recognize this is pretty hand-wavy, but it’s just an intuition I had that I though I’d throw out there to see if anyone had already thought about this, or if there are useful directions to go in related to this intuition.

Thinking this over… clearly the big issue with doing one long trajectory with momentum diffusion and then sub-sampling it (a kind of importance sampling I guess) is that you can’t store ALL the intermediate steps, so you’re going to have to downsample along the way anyway.

Still, I like the idea of thermostatting, starting out at random initial locations, ramping up the kinetic energy until you move out of the deep tails as fast as possible, and then thermostatting the kinetic energy into an appropriate range once you fall into the appropriate typical set.

The momentum diffusion process is more or less what you get in the limit of small treedepth, and for the most part I’ve found that small treedepth gets me into the typical set in far less wall-time than long treedepth. My intuition on this is that treating the sampler as a ball rolling around in a complex high dimensional space, it has a tendency to fall towards the bottom of the gravitational well, which is the peak density location. The frequent resampling of the momentum in essence brakes this fall, so that it doesn’t overshoot and head far out into the tails like a comet. Obviously, regarding treedepth, if you resample the momentum too quickly, you don’t get the benefit of falling towards the black hole… but if you resample too infrequently, you get the comet-like behavior. The ideal behavior would maintain the kinetic energy in a narrow region (a thermostat) so that it can fall towards the black hole but not gain enough energy to overshoot. This corresponds to there being an optimum treedepth, and I have found that is the case. treedepth of 3 or 4 is rarely good, treedepth above 8 or 9 is usually too “cometlike”, somewhere in the 6 to 8 range gets me into the typical set in the least wall time

Once in the typical set, however, I would expect to get better results from longish trajectories between complete resamplings, as is currently the main idea for the sampler. An alternative is to perturb the velocities slightly at each step instead of completely reset them every N steps.

The advantage to slight perturbations and momentum diffusion is that it seems like it would be less subject to difficulties in the choice of momentum proposal (which I think is sort of dual to mass matrix estimation). Basically each momentum proposal has a large component that is determined by the local generalized force, and a small component that is determined by random perturbation kernel. As long as the generalized force dominates, you tend to go in directions that “make sense” whereas if you project momentum down to zero and then propose a completely new momentum (lift), you have a definite potential for kicking the puck right at the walls of the trench in many dimensions and might start each trajectory bouncing off the walls until you settle down into a useful direction, getting divergences, and generally requiring tiny time-steps… this would reproduce the behavior I see when I have many tightly identified parameters. Pretty much any momentum you give to a tightly constrained parameter is going to make it bounce back towards the peak very stiffly, and require tiny time-steps (for example a simplex where one of the entries is very tightly constrained near 0), and so it never goes very far, whereas if there are a few far less constrained parameters, the tiny time-step ensures that they don’t go very far, and the treedepth blows out to infinity so that something normal(0,10) can get out near 10 in 100,000 time-steps of 0.0001 so each iteration wants treedepth = 15 just for that one parameter

Of course you might say “gee you need a reparameterization” but particularly with simplexes this just seems to be not that workable. Right now I’m working with a model of alcohol drinking behavior based on survey data, and I have a 7 dimensional simplex for each person representing the fraction of days they drink 0,1,2,3,4,5,6+ drinks. Most people drink 6+ drinks approximately zero days out of the year (same for 5, and even 4 is not that common), so the 7th entry of the simplex is tightly constrained near 0 for them… but around 1 percent of people drink that much 50-100% of the time, and they drink 0 or 1 drinks also basically 0 days out of the year, so a different dimension is tightly constrained for them.

my intuition about small timesteps being required is that every momentum resampling is probably slamming those constrained entries into the wall of their tight channel… but a momentum diffusion process would rarely put much momentum on those tightly constrained molecules.

Hmm… there’s also this idea:

q[i+1] = q[i] + dt p[i]

p[i+1] = p[i] + dt F(q[i] + dq[i])

with dq distributed say normally, or possibly according to the jump forward or backward method used in SPSA: http://www.jhuapl.edu/SPSA/

now with i indexing time, and j indexing dimension:

F(q[i] + dq[i]) = F(q[i]) + sum(partial(F,j)dq[i,j])

which now seems to automatically take the local stiffness into account in giving us a new momentum proposal. It can be thought of as a random momentum proposal that’s automatically on average aligned with the correlations in this region of space.This gives us a stochastic local estimate of the curvature without having to iterate the auto-diff stuff I think. The inexactness of the estimate of the local stiffness probably doesn’t harm us much so long as we understand that we’re not trying to stay right on the constant energy level set, we’re intentionally diffusing across level sets at each step.

@betanalpha am I off the rails here? let me know if this is mostly noise.

You should check out Michael’s arXiv papers on Riemannian extension to NUTS and on adiabatic sampling.

Random perturbations almost never help in high dimensions. As is, the momenta are updated to preserve the Hamiltonian (barring numerical errors). What you have to do for this to make sense at all is prove detailed balance (or more generally, that you get the right stationary distribution).

Bob, I just saw this post on Andrew’s blog from a while back, and I think it’s summarizing a similar thing (or at least, a similar region of idea space) to what I’m proposing here: http://andrewgelman.com/2014/05/20/thermodynamic-monte-carlo/ when I read the paper by Betancourt I just don’t have the background to follow it exactly, contact manifolds and soforth, but it sounds very similar.

My idea is more or less to have an external microwave oven / microwave cooler that Andrew mentions ;-) There are two issues:

1. Starting at a random initialization point, you are far out of equilibrium. You want to get into equilibrium quickly, the way to do this is bonk the system with some initial kinetic energy, and then let it fall into its equilibrium potential energy level, but as it does this, it gains a lot of unwanted kinetic energy because the Hamiltonian dynamics preserve total energy, so you apply the microwave cooler to suck all the excess kinetic energy out of the system. It’s a little like your initial proposal is always nitroglycerine in its liquid state, and your initial action is to smack it with a hammer… So it’s always going to explode releasing tons of potential energy into kinetic. You don’t want that, you want to suck that kinetic energy out of the system so you can see what the gas at standard temperature and pressure looks like. So you imagine a super sophisticated thermostat that can beam individual photons at individual molecules to slow them down (a kind of laser cooler). As the explosion occurs, huge amounts of precisely targeted photons cancel out the momentum of all the molecules causing it to instead of expanding like a bomb, just expand like a balloon being inflated to just the right size.

This sucking the kinetic energy out is in essence what happens when I set my treedepth small, it causes more frequent momentum updates, and each time you do a momentum update you are sucking energy out of the system because it’s in the phase where it’s falling down from its deep tail position so it’s gaining kinetic energy so the new momentum proposal tends to be lower kinetic energy. Eventually you’re in the region where you hit equilibrium, and then you mostly want to keep the kinetic energy in that region so you can explore the appropriate slice of potential energy.

This super-sophisticated laser thermostat is in fact exactly the mechanism described by Evans in this paper:

http://williamhoover.info/Scans1980s/1983-3.pdf

At each time step, you invent a body force on each molecule that either adds kinetic energy or sucks it out of the system in an appropriate way. Another way to think about it is as a Lagrangian system with a thermodynamic non-holonomic constraint. Hamiltonian dynamics typically doesn’t deal with that kind of non-holonomic constraint well. Evans uses Gauss’s principle to invent the appropriate force. I see in a description here: http://www.math.ucsd.edu/~y1zhao/ResearchNotes/ResearchNote007Thermostat.pdf that the Gaussian thermostat is a special case of the Nose-Hoover thermostat, but it’s the one which according to Evans (linked previously above) minimizes a certain distortion of the phase space…

So, part (1) is really about initialization, how to get into the typical set in very little wall time. It’s not actually relevant to the question of whether you’re getting the right long term distribution, as it’s just a mechanism for during adaptation and initialization. You could do just momentum rescaling for example, and it’d suck the energy out fine and get you into the typical set quickly. But, if you invent a sophisticated thermostat like the Gaussian thermostat, you can use it to continuously perturb the system even during the sampling dynamics.

1. RE Getting a sample from the right distribution: In high dimensional spaces, we have the typical set. All the probability mass is in the region whose log-probability-density is within epsilon of some fixed constant. The lp is the potential energy, so the idea here is once you have a thermostat that helps you get into the right typical set region quickly during initial adaptation… you can use the thermostat to help you explore the typical set as well. Suppose you just get yourself a sample which has potential energy that is uniformly distributed between lp* ± epsilon. This is the typical set, but the weights are not quite right. A sample which is uniform on the typical set is probably a good approximation on the basis of the exact same logic of why the typical set exists… but a uniform sample of energy levels around the lp* value can be thought of as an umbrella sampling scheme. Think of a chi-squared distribution, it’s the distribution of the potential energy in a multivariate normal. Well imagine approximating the samples of the multivariate normal by sampling from a uniform distribution in energy that fits like a square over the high probability region of the chi-squared (that is, samples on shells of radius r ± epsilon). Now if you have the uniform sample on energy, and you do importance sub-sampling from this sample, you can reshape the energy distribution to fit the chi-squared shape, you wind up with a sample that has the properties you need. So it’s not really about getting the dynamics to give you the appropriate stationary distribution, it’s more about using the dynamics to get you a simpler stationary distribution that is “close” to the correct one, and then at a final stage correcting the sample by sub-sampling. Another option though is to dynamically alter the weights you’re giving to different energy levels, that is, to learn the thermostatic energy schedule needed. in the chi-squared example, you’d want to set the thermostat to move towards the core of the chi-squared when you’re in the tail, and stay in the core longer when you’re in the core, so that you get the right chi-squared distro. Since the tails of the energy distribution don’t help you much, doing treedepth = 15 while you’re deep in that tail is probably a waste of time.

Now, umbrella sampling a uniform band around the lp of the typical set might seem wasteful, you have all these samples you’re throwing away… But as it is, in many of these big models I’ve experienced, we’re doing 1000 to 32000 dynamic steps and throwing all but one of them away at each iteration. If I ask for 200 samples, I’m first doing 1 Million timesteps and then throwing away all but 200 of the 1 Million positions I visited… If I could get those 200 samples from say 20000 positions, I’d be speeding things up by a factor of a hundred. Since setting treedepth = 7 or 8 is getting me into the typical set during the initial phases a factor of 10 or 100 faster than treedepth = 15… I suspect there is plenty of improvement to be had along these lines.

Of course, the question is, under what specific circumstances and algorithms does this all make sense and give you the appropriate final sample from the posterior distribution that you wanted, and do so in very little wall time. And I freely admit to not having any proofs of correctness or anything like that. It seems already that @betanalpha has thought of some of this stuff with his Adiabatic sampling idea, but he’s cast it in the language of the geometry of manifolds, and I don’t know enough about that to say exactly where things overlap vs not.

One thing that might be helpful is to have an example model that has this pathological behavior of wanting long treedepth and soforth. I may be able to provide that soon.

@Bob_Carpenter

As for correctness of sampling, here’s my current thought. Suppose our system has many parameters, so that its dimension D is large, and X is a sample from a Base measure equal to the distribution proportional to Indicator(|log(p(x)) - lp*| < \epsilon). Then if we sub-sample X with probability proportional to p(x_i), since the density on the base measure is constant, the sub-sample is a sample from p(x) truncated to the epsilon typical set. Since for dimension D sufficiently large, the typical set contains 1-\epsilon of the probability mass, the resulting sample can be made sufficiently close to a sample from the un-truncated distribution that for finite sample size the monte carlo error is sufficiently small for government work (proof not given, but I think the direction to go here is straightforward and follows from the existence of the typical set, clearly \epsilon is a tuning parameter and probably optimally depends on things like the dimension of the model and the desired fidelity/sample size).

Now, suppose D(x,t) is a dynamical evolution of a coordinate x such that it follows a Hamiltonian evolution plus a random force. This is similar to the random force produced by the process described in this paper: http://proceedings.mlr.press/v32/cheni14.pdf

In that paper in theorem 3.1 they show that the entropy of the distribution of their simplistic stochastic gradient system increases with time. I think a similar technique could be used to show that adding random forces not caused by sub-sampling the likelihood but instead simply intentionally injected into the dynamics by the sampler also causes the entropy of the distribution to increase in time as well. Exercise left for the reader ;-) my favorite idea is to compute the hamiltonian generalized force based on evaluating the gradient of p(x) at a random x+dx perturbation to the current location, because I think this automatically locally adapts, but it’s just an intuition, really any constant diffusive heating should increase the entropy and make the long-term distribution more uniform.

If the entropy of the distribution increases continuously with time, and the support of the distribution is bounded, as it is when it’s restricted to the indicator function as defined in the first paragraph, then the dynamics tends to the maximum entropy distribution which is uniform on the bounded subset. In other words, for sufficiently long trajectories the trajectory can be treated as a random sample from the indicator distribution originally discussed.

So, running sufficiently long, if we sample from the hamiltonian monte-carlo system using the density p(x) plus random forces, and we continue in time forcing our dynamics to stay within the sufficiently large epsilon band and constantly adding entropy in the form of random forces perturbing the trajectory, then after producing a long trajectory, we sample from our path with probability proportional to p(x) the ensemble of samples we get is an ensemble from p(x) because the trajectory tends towards the uniform density on the bounded set and the sampling proportionate to p(x) corrects the measure to the one we want.

Now, obviously for 3 or 4 or 5 parameters this probably doesn’t work. But as an algorithm for use in models with 50 or 100 or 1000 or 25000 parameters it sounds promising. The key is to initialize the system somewhere, thermostat it down into equilibrium lp, then choose an epsilon, and send the hamiltonian dynamics along with constant perturbation forces, and forcing it to stay in the epsilon band. Finally, every so often we sub-sample the trajectory and keep only the sub-sample points.

But, how do we keep the system in the epsilon region? There are a number of possibilities. One would be to run until you hit the epsilon band, stop, and restart somehow. Another would be to set up two epsilon bands, one larger than another. When the dynamics leave the outer epsilon band, you turn on the thermostat, either injecting or removing heat until you get back into the outer epsilon band. When it comes to decide which samples to keep, you set the probability of those samples outside the inner epsilon band to zero, and sub-sample the remaining ones from the inner band proportionate to p(x).

The intuition on all of this is basically as follows:

1. Hamiltonian dynamics preserve energy, but we start far out of equilibrium… in the initial state of the algorithm a thermostat will suck the excess kinetic energy out of the system getting us rapidly into the typical set rather than having the system explode with kinetic energy like nitroglycerine being whacked with a hammer.

2. Symplectic Hamiltonian flow follows a single energy level. When there are large differences in scale, particularly when some parameters are very well identified and others aren’t, and the mass matrix isn’t well estimated yet, the time-step required to avoid divergences is very small, and very long trajectories can be required. On the other hand, long trajectories on a single energy level limits the exploration of the space, and a diffusive momentum noise, particularly one aligned to the local forces as you get from F(x+dx) for dx a random small perturbation to the current location would cause the system to explore a larger region of space. This extra noise is a kind of umbrella sampling of a smoother distribution which is potentially easier to follow, and has less curvature.

3. The random momentum noise on the other hand tends to increase the entropy with time, leading towards a uniform distribution and giving the wrong answer.

4. For high dimensional models, so long as we constrain the dynamics to be in an epsilon region around the lp* value of the typical set, if the dynamics tend towards uniform on the typical set, this is ok, we can easily correct it by sub-sampling locations proportional to p(x)

5. To keep ourselves in an epsilon region around the typical set, we can invoke the thermostat when we get well outside this region, adding heat when our potential energy is too low, forcing us on average back up into the typical set and subtracting heat when our potential energy is too high forcing us out of the tails back down into the typical set. So long as the thermostat only acts when we’re outside the epsilon region, the uniformity of the measure within the epsilon region is preserved while we spend most of our time in the vicinity of the typical set, and can figure out how to correct the measure easily due to the entropy generation leading to uniformity on the epsilon region.

6. finally, once we have a very good approximate sample from this diffusive umbrella-hamiltonian thingy corrected by sub-sampling proportional to p(x), if we want to apply a further correction, we can fuzz all the locations by a simple diffusive metropolis proposal and accept/reject to further correct the ensemble towards the correct distribution (?? seems like a good idea at least).

Hmm… well I guess that’s it, as much as I’ve got intuition wise. Maybe what I should really do is code it up somehow on an example model and see empirically how the idea performs.

I see that Radford Neal already anticipated me, which is of course to be expected ;-)

See section 5.3 partial momentum refreshment which is closely related to this proposal. He mentions that it is in fact slightly more effective than standard HMC in the example problem to which it was applied but not necessarily enough to be worthwhile in those cases. However, there’s more to the whole idea than just partial resampling of the momentum. I think the idea of intentionally increasing the entropy through time rather than leaving the canonical distribution unchanged, and thereby sampling from a uniform band of potential energies which we importance sample down to the original distribution could have merit in dealing with problems where divergences occur due to high curvature. In essence we wind up allowing the transition to invent energy to tunnel through regions of high potential where we’d otherwise curve wildly, and then deal with the problem later by downsampling the trajectories.

At this point, I think I’m just archiving a bunch of ideas here, but anyone else please feel free to chime in. From my perspective I’m very tempted at this point to try some numerical experiments to see what the behavior is under different epsilons and thermostats and momentum noise terms etc.

If you’re testing, you want to test calibration with something like Cook-Gelman-Rubin, or at the very least with interval coverage. And look at a range of models. @betanalpha has put up a bunch of vetted examples on the evaluation repo in stan-dev on GitHub.

There are a lot of people working on approximate HMC. We’re taking a rather different tack on approximation algorithms and looking at variational inference, EP, and max marginal posterior mode.

Good thoughts, especially the C-G-R testing.

I’ve noticed that vb rarely does a good job of estimating the models I have trouble with. What it does seem to do is get me close enough that it provides a good source of parameter initializations, so then the first 5 to 15 iterations of Stan aren’t out in the “nitroglycerine” range of potential energies.

Typically, during debugging, where I want really fast turnaround to tell me if I did something stupid, if I do vb vs I ask Stan to give me 15 or 30 iterations, the 15 or 30 iterations of Stan tell me what’s up better than the VB. Basically just a single sample (or 3 to 5 effective samples maybe) from the actual typical set is often what I really want during debugging. Unfortunately maximization gives me something that’s not in the typical set, and VB typically doesn’t get me into the typical set either… approximate HMC that just moves around near the typical set would be way better than vb in most of my cases.

Are you using the sampling output of VB or just the means? The posterior mean is often not in the typical set whereas a draw from the approximate posterior centered at that mean would be.

Yes, I’m using the sampling output. It makes for fine initialization points, but it can often be very very biased for some important parameters, and because of that it doesn’t tell me whether my model is working well.

My typical method these days is:

1. using random starting points, run VB until it converges
2. Take 4 random draws from VB as my initial vectors for Stan
3. Run Stan for ~ 50 to 200 iterations starting at inits in (2), watch the CSV files to see when LP settles down to a stable value, make sure it has settled down before sampling starts, otherwise extend the run.
4. use the samples from Stan to plot lots of summary plots about my model to see if it is making sense.
5. Decide what changes to make, and repeat.

Eventually if I have a model that I think makes sense, I extend step (3) to run for 500, 1000, or 2000 samples with longer treedepth (10-15), which can take 10 to 40 hours on some of these models. Even when I do this long warmup and run with deeper treedepth, and high adapt_delta, I can get divergences (for example current model uses a bunch of simplexes and has some divergences).

If I could get something to sample a simpler posterior, and then down-sample this simpler posterior to a sample from my real posterior, and thereby avoid divergences and run for shorter treedepth and soforth… it’d be a big help.

Here’s how the current model that is giving me fits works:

It’s estimating how people in the US actually consume alcohol. It uses a CDC / Census dataset sample of 2000 telephone interviews downsampled from a 500,000 or so interview sample. It models each person’s behavior as a dirichlet vector of 7 entries corresponding to frequency of: 0,1,2,3,4,5,6+ drinks per day, plus a quantity for each person describing the amount they drink on the 6+ days.

Then it has some global parameters that describe measurement errors / biases / etc so that you can predict how a person will answer the questions based on their underlying behavior vector.

Each person’s dirichlet vector gets a mixture prior over 7 basic modes of drinking behaviors “very light, light, medium, binge, heavy, very heavy, other”, the mixture weights are given their own dirichlet vector with a simple dirichlet prior.

I derive an average consumption per day for each person in the transformed parameters, and then I have a global sales per person per year estimate from a separate survey and the overall average across my sample of 2000 enters into a likelihood for observing this global sales number.

The end result is that I get a behavior estimate for each person in the sample. The fact is, I don’t really care about individual behaviors that much though, I care about the distribution of consumption across the 2000 people accounting for known measurement issues. A big problem with all of this is that people are known to do a very poor job of estimating their own alcohol consumption, and some people misunderstand the question in obvious ways, also some people report inhuman alcohol consumption (say 70 drinks a day).

Without a measurement error model you go very far wrong by just believing people’s responses, but with a measurement error model it’s still hard to explain the disparity between survey answers and national sales data.

The fact is, though, that inaccuracy in estimating a few of the behaviors, say 10 or even 50 of the people out of 2000 really doesn’t make much difference in the final results. Graphs all look the same qualitatively, and overall averages and percentile curves and etc all have similar shapes. But, of course with so many weird behaviors (1% of people really do drink basically every day, and an average of around 12 drinks a day! whereas huge numbers of people drink 0 drinks between 20 and 30 days a month. ) Inevitably some of these dimensions will have tricky to sample regions. In this model, the “pairs” plot would produce a grid of 14,000 x 14,000 = 10^8 th plots so it’s not particularly useful to attack things in that way.

What really matters here is the behavior of the measurement error model. Does it do a good job of imputing something within the range of rational for each person? For the most part, just taking the posterior mean for each person can be enough, it just needs to be one-value-per-person, but corrected for measurement issues.

Anyway, I think Stan does a FANTASTIC job with models that have 5 to 10 parameters, and in those cases, you absolutely need to have exact convergence to the posterior distribution over those 5 or 10 parameters. With micro-data inferences involving thousands of parameters on thousands of individual people, each one isn’t so important, and so in some sense point-wise convergence of the distribution is a lot to ask. I’d be happy with convergence to a posterior whose high probability region is the same typical set as the typical set of my actual posterior. Hence, for example convergence to the distribution that’s uniform over the set where my real lp is within epsilon of the mean lp. The relative weights within the typical set are way way less important, because the typical set is knife-edge-thin anyway and the inferences about individuals are not all that important so errors in those inferences don’t affect the inference at the higher-level more important quantities.

If I could get a large uniform-over-the-typical set sample in very little time, sub-sampling it proportionate to the density would give me a very very good sample. But VB doesn’t do that, and Stan tries in some sense too hard, requiring each and every energy-level exploration to solve Hamilton’s equations exactly to double-float precision, which is a lot more than is necessary to get that “uniform-over-the-typical-set” sample I think.

Not that Stan isn’t doing a great job of what it’s doing, just that maybe for certain classes of high dimensional models it’s trying a lot harder than necessary given concentration-of-measure. Adding in some “wiggle room” in the exploration, in a way that fuzzes out the sample towards “uniform-on-the-typical-set” would I think require a lot less computing, but I just don’t know.

I was intrigued to read some of this, as I think some of the problems I’m interested in suffer similar difficulties. This plot shows (I assume!) the overshoot issue - here I ran with a treedepth of 6, letting the sampler fall back from the overshoot much more quickly…

Yes, clearly it has huge momentum initially, but at each momentum proposal, it’s like a thermostat, putting the kinetic energy back in equilibrium, at treedepth 6, the re-proposals occur more frequently in terms of per function/gradient eval. but still, it’ll have traveled up into the potential well too far. With treedepth of 15 you could imagine that that ringing could occur for 32000 time steps before the momentum is resampled, and then continue again for 32000 time steps, and etc etc, getting to iteration 200 might take a day. If this is 1200 iterations at treedepth 6, then it’s something like 77,000 total function evals. At treedepth 15 it’d blow through that computational budget by iteration 2

Also note how the typical set is more or less something like lp__ in [-6250,-5250]. If you ran this as one long trajectory with momentum diffusion, and a thermostat that keeps lp__ from going outside that band, you’d wind up with more uniform sampling on that band, but if you downsampled this long trajectory proportional to p(x) you’d wind up with the right distribution of lp.

Is this a high dimensional model? How many parameters does it have?

Yes, it’s a hierarchical state space model / linear sde, with about 4000 pars…

Was the final inference sensible? How many neff do your most important parameters have with this run? what is the distribution of neff across parameters?

Also note, from the physical analogy, the number of parameters is like the number of molecules. So someone who has a 3 or 7 parameter system would get a tiny little bang like a firecracker, whereas someone who has 7000 parameters would get a bang like 1000 firecrackers all going off at once. Furthermore. The size of the firecracker is inversely related to how well identified the parameters are. If your model is very well informed, then any starting point you pick randomly is deep in the tails, so very large initial potential energy.

The people i’d expect to have the big problems with this issue are those with large numbers of parameters who have very good models that well identify many of the parameters. In that case, starting at a random initialization point it’s like having 7000 large bombs go off all at once, and it’s no surprise that Stan wants to simulate that using tiny time-steps.

No, wasn’t sensible, would need to run longer. Some of the important parameters (ie, the population parameters) are the slow ones. A run with 50 subjects and 10000 iterations had neffs of 500+, and gave seemingly reasonable inferences - just a couple of hundred divergences that I’m still not sure how to deal with, or whether I should really worry, as they persist with delta=.99 and I’ve run out of reparameterizations…
50 subjects instead of 200 shows similar but less pronounced behaviour, and seemingly faster mixing, which surprised me… (first 8 iterations cropped)