Hamiltonian, WHY?

Dear Stan Community,

Here comes yet another wall of text, related to frustration on
trying to understand “HMC”.

So, here it comes. BUT !

BE ADVISED :

it might not make much sense (might sound mambo-jambo/throwing around ideas/ looking for inspiration).

So, WHY HAMILTONIAN ???

Why not something else ? Say accelleration is a=(F/m)^{0.90} ? => This is not compatible with the Energy conservation. Why to conserve energy ? When in the Metropolis MCMC T is constant, not energy, FOR EXAMPLE, so it REALLY would not matter at ALL (as I “explain” in the next paragraph)!

So, the problem is worse than just some crazy person’s crazy rambling around.

If we use the “wrong Newton equation” in our Leap-Frog integrator (Molecular Dynamics type of algorithm, I did lot of that => hence my obsession with this topic.) then we are not going to conserve energy, big deal. We can calculate the new energy, and accept it / reject it, according to the tradictional Metropolis method.

Ok, just to make it clear what I am talking about. Metropolis in itself is “worthless” - let’s be a bit “drastical” (it is not true but not practical => wortheless, almost, if Stan had only Metropolis then there would be no Stan community).

So, then HMC is basically - as far as I and I OVERSTAND - :) - is Metropolis + microcanonical (energy conserving) MD (molecular dynamics).

So, WHY WHY WHY WHY WHY ???

Why is MD such a big deal ??? Why is the Energy (defined by the Hamilatonian is such a big deal ?). Why HAMILTONIAN at all ???

I am pretty sure the answer is in classical mechanics, buried somewhere, why is Hamiltonian formalism so successfull ? Why not just using the forces and Newton
equations ?

BUT OK, In physics, Hamiltonian is “good” because “energy conservation is a fundamental law”, time translation symmetry, etc, AND WAY BEYOND THAT, Hamiltonian in physics is “THE GOD”.

BUT IN HMC ? For Bayesian inference ??? WHY ???

I mean, HMC was originally invented for physical systems, to calculate physical, macroscopically measurable thermodynamic variables. It had nothing to do with Bayesian inference.

So… why HAMILTONIAN ? Does it really have to be Hamiltonian ?

If yes, WHY ?

:) :) :) :) :)

I have the feeling that the answer is super simple, and most likely it has to do something with weekly interacting systems and perturbation theory (in other words, the statistical problems can be thought of some sort of physical system with independent components, which interact weakly, then the Hamiltonian makes a lot of sense - due to perturbation theory - but this breaks down in strongly interacting systems… so… ??? … bla … :) ).

(But this is just a guess, which I “deducted”/intuitively from Turing Machine/Kolmogorov Complexity/ MDL / Solomonoff type of arguments related to Occam’s razor…)

OK…

In case the answer is obvious to someone, please let me know.

But the question is - in essence - WHY the HAMILTONION ? For a problem where
there is no HAMILTONIAN which describes the observed data points ??? :) :) :)
See ?

Have a good night :)

I feel very confused.

Cheers,

Jozsef

You might want to reread the question, the Gibbs sampler doesn’t come up. The actual question is more interesting than the one you’re suggesting and we try to keep discussion friendly.

@Jozsef_Hegedus you will probably get a better answer if you can narrow this question down. In the current algorithm the physical simulation is what gets you a trajectory to sample from. It’s only an approximation of the Hamiltonian and because of that error we do exactly what you say (metropolis step). So sure you could do something else and if you got it to meet the required conditions it would give samples from the posterior… the error would be bigger, the algorithm would do more of a random walk, and you’d throw out more CPU time by rejecting. It’s a complex system and there might be room for tradeoffs that make another algorithm come out ahead but when it comes to exact inference Stan’s killer feature is the implementation and math lib so it’s more of a "why not the Hamiltonian?"question.

4 Likes

It’s only an approximation of the Hamiltonian and because of that error we do exactly what you say (metropolis step).

^^^^ this is “interesting” ====> I HAVE NO IDEA ABOUT STAN’S INTERNALS (I only have just some basic physics “intuitions”), so what follows here is “the first thing that comes to my head” as to why I say: this is interesting. (So what comes might be a bit too speculative, but at least in a more or less “solid ground”.)

So, the quoted part is “sort of” interesting. However, the whole point of the story is (afai understand the story in “solid state physics”/"particle etc… physics " is to calculate some expectation value - for example pressure, magnetization, number of particles, magnetic susceptibility - “a tricky one”).

Usually the mean kinetic energy gives the temperature, so if you were to want to do constant temperature simulation then you 1) would use a “Nosé-Hoover” thermostate (heat bath) or 2) “brute force” rescale the velocities to get the constant temperature at every 100 MD step.

AFAIK, the Metropolis is used for the following situation:

Constant number of particles (degress of freedom/variables), constant temperature, constant volume (the “size” of the non momentum phase space is constant => this is not true for Stan afai “imagine” - this is something to think about, btw, maybe the “prior” sets the constant volume - in a “soft” manner - in other words - the prior acts like a “wall”).

Of course, in thermodynamic limit, if you set the temperature, or, keep the energy constant (isolated system), it does not matter in physical systems which one is the dependent variable.

The two approaches : 1) keeping constant the energy or 2) keeping constant the temperature are statistically the same (in macroscopic systems).

In practice, however, it is “easier” to turn on the heating in a room and kick around the atoms, than to wait and pray, chanting “please atoms, raise your energy vibrations”.

So, this is the reason for the existence of the Metropolis algorithm, to be able to see how a system behaves, if the temperature is set to X degree Kelvin. You can measure that, in reality, easily, with a 1$ piece of glass containing some liquid metal, but not the Energy, hence the Metropolis algorithm was invented by Metropolis, Ede Teller and his wife (and one more person whose name I don’t remember, sorry).

So this sentence that I quoted means that you do set the temperature in the Metropolis step, as they do in “plain” Metropolis too. Which tells me that your intention is to: “keep the temperature constant”.

Leap frog (or some other version of it) is pretty stable - I have never noticed any problem with energy conservation. It is almost surprisingly too stable. There is no drift - in energy, only fluctuations, but I was very “shocked” to see that nobody really ever mentioned that they have problem with energy conservation due x y z.

Currently this system - to me - it seems to be in vacuum - held together by some external attractive force, the prior. So if the potential due to the prior is not counted in when calculating the total energy then there is no wonder that the calculated energy (which is not the total energy) drifts. (I really don’t know the details here - but if you sample the posterior then the total energy has a contribution from the prior too). So I might be saying some very “crazy nonsense” here, but I could imagine serious energy drifts if the energy calculated does not has a term coming from the prior. Nevertheless, the total energy is set by the temperature given in the Metropolis step - which, should and DOES contain the prior part too (otherwise the prior would have no effect). Ok, I am no Stan author but the fact that you see energy drift (so serious that you need to do a MT step) could be due to a missing energy term coming from the prior.

Just a thought, from the top of my head, I might be COMPLETELY wrong here.

Ok. :)

But now, comes the second part that is interesting: “how do you set the temperature” ?

Is there some theoretically optimal value for that for HMC for Bayesian sampling. Or is only some sort of “heuristic” saying that some virials should be x y z, such that all energy level are “sufficiently touched”.

Funnily, as a function of temperature (or interaction strength - some constants in the likelyhood), phase transitions can occur. For example the mixture model type of multimodality reminds me of the 2D ising model above the critical temperature. So this gives me the intuition that Stan is operating “above the strange phase transitions” “energy regime”.

Ok, I hope this was not too much mambo jambo.

From a tired and hungry cond-mat-phys/half-cs guy. If yes, sorry.

I am off to eat and such.

Let’s keep these questions in the back of our minds.

At least I will, in mine.

Last word : I am still distilling a lot of information into something more crystallized. So I hope in a few weeks I will be able to put my questions/understanding of Stan into more precise forms. Currently this is very speculative/intuitive/hungry/dreamy…

Over and out.

Cheers,

J.

JUST a SUPER QUICK thought … I write it down before “it goes out of my head”.

I think I “know” the answer : ergodicity/size of phase space/Liouville theorem/classical mechanics/“statistical mechanics”.

When you use Metropolis, then you use “ensemble” averaging.

When you use the “Molecular Dynamics” part, you are using traditional equation of motion based averaging/sampling.

The two are only equivalent due to the ergodicty “hypothesis”.

Which… as far as I can imagine you ONLY (I BELIEVE - maybe I am WRONG) get if you use the equations of motions from the Hamiltonian formalism, I have the feeling that this can be derived somehow, from classical mechanics. Maybe it is called the Liouville theorem. I don’t think it is to difficult to prove that if you do not use the Hamiltonian formalism then you cannot switch between the MD and MCMC samplings because if the equations of motions are described by the/“a” Hamiltonian then the size of the “phase space” that the system explores “changes”, so it means that the MCMC and MD parts are not sampling from the same distribution (since the ergodicity is being violated if the size of the phase space is not constant which is true if the equations of motions are not derived from the/“a” Hamiltonian). I guess this is a problem …

Sorry for the too “hand weavy” description. Maybe some parts I got wrong here in this argument, this just sprung to my mind as I was walking around in the kitchen… I think I have answered the question (at least for myself) - but I might be wrong.

I wrote it down here quickly - in case someone might be interested in an answer (which might be wrong / too hand weavy / too walking-around-in-the-kitchen-y ).

Thanks for the attention and comments so far.

Over and out.

Cheers,

Jozsef

Please consider stepping back and reading through the relevant literature before spending too much time speculating,

The physical analogy is something of a red-herring and shouldn’t be taken too seriously until one better understands the formal mathematical equivalences.

1 Like

Thanks for the references !

I was thinking I should start at the roots:

"The method is especially efficient for systems such as quantum chromodynamics which contain fermionic degrees of freedom. "

After all it all came from QCD, why… I don’t know. (Funnily the fermionic wavefunction is antisymmetric which is related to the exchange degeneracy - kind of a big deal in physics - ironically also in the mixture distributions - somehow there are frighteningly many such “resemblances” - I wonder why.)

Why the Hamiltonian, that is sort of intuitively clear to me - now - I was even thinking that it is almost embarrassing that it was not immediately obvious - I mean - not after 15 minutes of thinking, given my education in physics - why this answer took me so long to come up with ? Embarrasing.

Also my classical mechanics is a bit rusty, that is a (much) bigger problem I think.

The physics analogy has roots in physics, more specifically in classical mechanics. Why the Hamiltonian formalism even makes sense?

This is the real question. Why the Energy? Why not simply Newton’s equations of motion ? At least in physics… why does HMC even work for QCD ? Why the conservation laws ?

That is the root of the question. I think my classical mechanics was never really good.

If I really want to understand what’s going on (at least in the original HMC) then the first thing I should do is to re-read classical machanics, things like pendulum :

https://en.m.wikipedia.org/wiki/Generalized_coordinates.

That should explain why the Hamiltonian formalism exists even in the first place, without that there is no statistical mechanics, without that there is no MC, without that there is no HMC.

Back to the basics, back to the roots.

I think Feynman did mention these topics in the Messenger lectures (at least very biefly) :

why one is better than the other.

No free lunch. No matter what. Hamiltonian or not. The law of “problem conservation” - as I was first taught 20 years ago - in a classical mechanics exercise session - again - ironically - who would have thought ?

I would not throw away the physics analogy, that is where all this stuff is coming from, there might be some nice intuition there. How old the Hamiltonian formalism is? I don’t know from the top of my head but if it is still around then maybe it might be worth something, especially if HMC is named after it. Like classical music - still around - after few hundred years. Probably not without a reason.

HMC is the most relateable angle - for me - in the ML “storm”. Other ML “methods” are some form of approximations to the Bayesian approach. It is almost a mystery why HMC even works - and when it works - why it works. Nevertheless, the papers on HMC try to answer these questions, which is exceptional in nowaday’s ML “storm”.

Respekt !