Real world Hamiltonian vs Artificial Hamiltonian for modelling the corresponding real world problem

WARNING !

  • “hand waveing arguments”
  • “speculations”
  • “half awake/asleep/trance state/hot-shower induced intuitions”
  • “3rd year, second semester, 2nd lecture” - grade statistical physics concepts
  • “1st year, second semester, 5th lecture” - grade Classical Mechanics concepts
  • all the “basic ML/CS” PhD level stuff as well, Bishop book and friends

are ahead !!!

This is DANGER ZONE :)

THIS WILL (most likely) make no real SENSE, on PURPOSE.

— YOU HAVE BEEN WARNED :) :)

Recently I came to some “deep” realisation on how “ML”/Bayesian inference is connected to Hamiltonian mechanics ( https://physics.stackexchange.com/questions/89035/whats-the-point-of-hamiltonian-mechanics/477966#477966) . Through phase space / information theory and most importantly INDEPENDENCE.

I am writing this post because the above link seems to confirm my intuition that there might be something really awesome insight lurking here, which is not obvious to me, maybe obvious to some of you ??? I do hope. Hence this post. Please, enlight me :)

Now, this stackoverflow question really makes me obsessed to not let go of the question : what is the most optimal choice for a Hamiltonian (let’s call it H_{MCMC}) used for the “MCMC part” for a system which is described by a “real world Hamiltonian” (let’s call it H_{real-world}). (x_i=1 <== checking Latex compatibility)

Given H_{real-world}, how can I find the most optimal H_{MCMC} that “solves a Bayesian inference problem” on data which was generated by a dynamical system (let’s denote it by S_{real-world}) whose equation of motion is defined by H_{real-world}, and same samples were taken according to the “Ergodicity principle” and / or “replacing ensemble average by time average” concepts.

But for now, let’s stick to the microcanonical (constant energy) ensemble.

I have the “feeling” that knowing the underlying Hamiltonion of the "to be modelled REAL-WORLD system, which is ultimately dynamic in nature ( hence EVERY data is dynamic in nature, no matter if it was generated by a Turing machine or by “the real world” ).

So my feeling is that knowing the equations of motions for the real world problem could provide some hints for “optimal Hamiltonian / sampling / whatnot” for the Hamiltonian used for the actual Stan calculation,where the data is simply datapoints in the phase space of a microcanonical ensemble with N degrees of freedom.

BIMMM !!!

I don’t expect any real answers, just “gut feelings”, “speculations”, “collaborative daydreaming”. Just the typical conference discussion after a few cookies / beers in Amsterdam after the conference dinner.

Out.

J.

I’d strongly suggest reading Michael Betancourt’s “Conceptual introduction to Hamiltonian Monte Carlo” (on arXiv).

If you’re going to stick with negative log density as potential, you only have the freedom to change the kinetic energy distribution. There’s a really nice paper on this which follows on nicely from Betancourt’s:

Then you can complete the set by reading Livingstone and Betancourt’s paper on geometric ergodicity :-)

1 Like

Hi Bob,

Thanks, indeed. That is true.

Also, it is interesting that this is a very recent paper!

Hmm, geometric ergodicity … wow, these are very addictive paper to a guy with theor-condmat-phys backround.

Difficult to resist to get deeply lost in them.

Of course, “stupid” question, but the potential energy can be expressed also in “many forms”, depending on the choice of “coordinates”.

Somehow this HMC seems to be linking “physics” to “ML” very strongly.

Very difficult to resist not getting too drawn into this thoughts too deeply. Nevertheless, somehow this connection is very … “underrated” ? I have the feeling. If there is any…

Nevertheless, I need to take this thinking a bit easy but somehow I have the feeling that HMC turns “ML” problems into “stat-phys” problems.

I am pretty sure that after reading the literature on this, it will turn out that people have thought about this a lot already.

:)

Ok, enough hand waveing. Thanks for the tip @Bob_Carpenter !

These papers are keepers.

and I need to watch some lectures on classical mechanics … again… it was in 1999 when I took that course, I think, tbh. I did not get the point :( - maybe now I will :)

Cheers,
J.

Thanks @maxbiostat ! J.

I had a super super quick look, actually a Search. I searched for “lagr”, in all three papers. To my surprise - no match. Well, maybe this is also a question - what the “Lagrangian formalism” means / not means for HMC. If it means anything. I am sure there is literature already on this too, plenty even, possibly, but, who knows.

It’s always nice to pull some techniques from one field to the other… “if it fits”. After all, there is “nothing new under the sun”. :) Or there is.

Maybe I will look into this one day. Or ask a few ppl who are not on this forum, if/when I meet them.

Just some a quick random thought. Maybe someone finds it interesting.

These are very “difficult to let go” questions. :)

Ok, I put here some thoughts, that are related to the original question, I don’t want to pollute the forum with all sorts of threads, so I just put all my “random” thoughts related to the connection to statistical physics here. Maybe something nice comes out of that at some point.

Anyway, I just want to clarify one thing, in STAN (or in HMC in general) the potential energy is the log likelihood right (assuming total flat prior)?

I mean, if I want to estimate the mean, m of a random variable X, that has a gaussian distribution
P(x<X<x+dx)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(m-x)^2}{2\sigma}}dx,
with \sigma=1, (mean is m – to be estimated), and I have one single measurement point : x_1.

Then the likelihood is (proportional to) :

e^{-\frac{(m-x_i)^2}{2\sigma}}.

So the log likelihood is (proportional to) :

-(m-x_i)^2.

So if this is the energy, and we use the MRRTT method :

and they accept a Monte Carlo move always if it decreases the energy and only with P=e^{-\frac{\Delta E}{k_bT}} if it increases the total energy of the system.

So, based on this, it seems to me, that if STAN is using log likelihood as Energy then it is simulating a physical system at constant temperature.

Right ?

Now the funny part comes : why the log likelihood is the “Energy” ? It could be anything else…

So if physics is a red herring (@betanalpha, which I don’t know what red herring is) then why is the log likelihood such an important business (which is basically the Maxwell Boltzmann distribution) ? My answer is that it comes from entropic considerations, if the only thing we know is that the total energy of a system is constant (which the Hamilton equations aim to model), then the maximum entropy principle forces us to use the Maxwell Boltzmann distribution, if we “work” in the , in other words, to use the log likelihood as the energy.

I find this connection to physics/entropy/information theory very interesting. I am not sure how obvious this is to people in the STAN community, but maybe it is, maybe not.

So, funnily, using the log likelihood as the Energy is the optimal approach, since before 1953, due to entropic considerations. Now the questions is, how important this is what it comes to Bayesian modelling – in general ? (and when using HMC to “do” Bayesian modelling – in practice). I would say it is essential, since using log likelihood as the energy is the “most” optimal choice due to entropic considerations (max entropy).

I mean, maybe this is obvious to everybody else than me, in the world, or not. If it is, then I am a bit “too slow” in understanding things, that’s fine, but if it is not, then is this interesting to the STAN community ? To understand why the Energy is the log likelihood ? Is it useful in practice to understand this, when trying to solve statistical problems with Bayesian approaches (especially with MCMC) ? I would think yes.

Anyway, I am just putting down these thoughts - as I am trying to understand MCMC/STAN. I might be talking about very trivial things, since I am totally newbie to MCMC, but just in case… I put out this simple thought.

Anyway… no big deal at all, no rush, just food for thought… maybe I am talking about something that is like 1+1 to everybody in this community and for me it was a big AHA moment :)

I hope I am not boring you to death with this physics to MCMC and back connections.

Have a good summer,

Cheers,

J.

It’s the negative log density—doesn’t matter where the density came from. For us, it’s usually a Bayesian posterior (log likelihood plus log prior).

You really want to read @betanalpha’s papers on arXiv for the connections to physics and an explanation of why things have to look the way they look.

His paper on adiabatic Monte Carlo gets into using different temperatures (heat bath style, not simulated annealing style, to preserve the physical analogy).

You should also pick up Livingstone, Faulkner and Robert’s paper on the kinetic energy distribution as that gets to some of the issues you’re asking about.

“Stan” is not an acronym.

1 Like

Thanks Bob, " why things have to look the way they look."

Yes. That is the super interesting question, it is just too much of a coincidence that
the way things look are so useful for calculating such integrals. There must be
something deeper there. In the sense of “no free lunch”.

Thinking about these things gave me a bit of better intiution about Bayesian inference
in general, especially how and why “Stan” (HMC), works at all, or for that matter, MRRTT to begin with.

Also, in particular, thinking about these things helped me to understand now a paper in a matter of minutes on Replica analysis of Bayesian data clustering, that I could not even begin to comprehend “what it was about” a year ago, after hours of thinking.

Bottom line, understanding and knowing about Stan, in itself, is very, very helpful in understanding Bayesian “inference” in general.

Bayesian “inference” becomes “more real”. Once one starts to contemplate the connection to physics and the why-s. Also, the other way around, the mysteries of statistical physics get clearer. Such as, why it is the Boltzmann distribution that describes the speed of chemical reactions and not something else ? It is connected to that “why is the negative log likelihood is the Energy” type of question. I find these connections fascinating.

Maybe a few years of reading the papers you suggested, Bob, and some other papers, I will be a bit better “intelectually equipped”, when it comes to these questions.

This direction (understanding Stan and its connection to physics) seems to make lot of sense to me, in pushing me towards a place where I can make a tiny bit of sense out of “Machine Learning” and its friends. I am happy that others have thought about these questions before and wrote papers about them. For example @betanalpha 's papers or maybe even more basic ones (which ones are more basic is maybe also a matter of “taste”/“history of eduction”), like :

Thanks for the links, Bob. Let’s see how I see these things once I become more “intelectually equipped” in the coming months/years, but even learning about Stan pushed my general understanding of Bayesian inference lightyears forward. Stan is the latin of ML :) , except for that it is still alive and kicking, unlike latin.

Cheers,

Jozsef