Real world Hamiltonian vs Artificial Hamiltonian for modelling the corresponding real world problem

WARNING !

  • “hand waveing arguments”
  • “speculations”
  • “half awake/asleep/trance state/hot-shower induced intuitions”
  • “3rd year, second semester, 2nd lecture” - grade statistical physics concepts
  • “1st year, second semester, 5th lecture” - grade Classical Mechanics concepts
  • all the “basic ML/CS” PhD level stuff as well, Bishop book and friends

are ahead !!!

This is DANGER ZONE :)

THIS WILL (most likely) make no real SENSE, on PURPOSE.

— YOU HAVE BEEN WARNED :) :)

Recently I came to some “deep” realisation on how “ML”/Bayesian inference is connected to Hamiltonian mechanics ( https://physics.stackexchange.com/questions/89035/whats-the-point-of-hamiltonian-mechanics/477966#477966) . Through phase space / information theory and most importantly INDEPENDENCE.

I am writing this post because the above link seems to confirm my intuition that there might be something really awesome insight lurking here, which is not obvious to me, maybe obvious to some of you ??? I do hope. Hence this post. Please, enlight me :)

Now, this stackoverflow question really makes me obsessed to not let go of the question : what is the most optimal choice for a Hamiltonian (let’s call it H_{MCMC}) used for the “MCMC part” for a system which is described by a “real world Hamiltonian” (let’s call it H_{real-world}). (x_i=1 <== checking Latex compatibility)

Given H_{real-world}, how can I find the most optimal H_{MCMC} that “solves a Bayesian inference problem” on data which was generated by a dynamical system (let’s denote it by S_{real-world}) whose equation of motion is defined by H_{real-world}, and same samples were taken according to the “Ergodicity principle” and / or “replacing ensemble average by time average” concepts.

But for now, let’s stick to the microcanonical (constant energy) ensemble.

I have the “feeling” that knowing the underlying Hamiltonion of the "to be modelled REAL-WORLD system, which is ultimately dynamic in nature ( hence EVERY data is dynamic in nature, no matter if it was generated by a Turing machine or by “the real world” ).

So my feeling is that knowing the equations of motions for the real world problem could provide some hints for “optimal Hamiltonian / sampling / whatnot” for the Hamiltonian used for the actual Stan calculation,where the data is simply datapoints in the phase space of a microcanonical ensemble with N degrees of freedom.

BIMMM !!!

I don’t expect any real answers, just “gut feelings”, “speculations”, “collaborative daydreaming”. Just the typical conference discussion after a few cookies / beers in Amsterdam after the conference dinner.

Out.

J.

1 Like

I’d strongly suggest reading Michael Betancourt’s “Conceptual introduction to Hamiltonian Monte Carlo” (on arXiv).

If you’re going to stick with negative log density as potential, you only have the freedom to change the kinetic energy distribution. There’s a really nice paper on this which follows on nicely from Betancourt’s:

Then you can complete the set by reading Livingstone and Betancourt’s paper on geometric ergodicity :-)

1 Like

Hi Bob,

Thanks, indeed. That is true.

Also, it is interesting that this is a very recent paper!

Hmm, geometric ergodicity … wow, these are very addictive paper to a guy with theor-condmat-phys backround.

Difficult to resist to get deeply lost in them.

Of course, “stupid” question, but the potential energy can be expressed also in “many forms”, depending on the choice of “coordinates”.

Somehow this HMC seems to be linking “physics” to “ML” very strongly.

Very difficult to resist not getting too drawn into this thoughts too deeply. Nevertheless, somehow this connection is very … “underrated” ? I have the feeling. If there is any…

Nevertheless, I need to take this thinking a bit easy but somehow I have the feeling that HMC turns “ML” problems into “stat-phys” problems.

I am pretty sure that after reading the literature on this, it will turn out that people have thought about this a lot already.

:)

Ok, enough hand waveing. Thanks for the tip @Bob_Carpenter !

These papers are keepers.

and I need to watch some lectures on classical mechanics … again… it was in 1999 when I took that course, I think, tbh. I did not get the point :( - maybe now I will :)

Cheers,
J.

Thanks @maxbiostat ! J.

I had a super super quick look, actually a Search. I searched for “lagr”, in all three papers. To my surprise - no match. Well, maybe this is also a question - what the “Lagrangian formalism” means / not means for HMC. If it means anything. I am sure there is literature already on this too, plenty even, possibly, but, who knows.

It’s always nice to pull some techniques from one field to the other… “if it fits”. After all, there is “nothing new under the sun”. :) Or there is.

Maybe I will look into this one day. Or ask a few ppl who are not on this forum, if/when I meet them.

Just some a quick random thought. Maybe someone finds it interesting.

These are very “difficult to let go” questions. :)

Ok, I put here some thoughts, that are related to the original question, I don’t want to pollute the forum with all sorts of threads, so I just put all my “random” thoughts related to the connection to statistical physics here. Maybe something nice comes out of that at some point.

Anyway, I just want to clarify one thing, in STAN (or in HMC in general) the potential energy is the log likelihood right (assuming total flat prior)?

I mean, if I want to estimate the mean, m of a random variable X, that has a gaussian distribution
P(x<X<x+dx)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{(m-x)^2}{2\sigma}}dx,
with \sigma=1, (mean is m – to be estimated), and I have one single measurement point : x_1.

Then the likelihood is (proportional to) :

e^{-\frac{(m-x_i)^2}{2\sigma}}.

So the log likelihood is (proportional to) :

-(m-x_i)^2.

So if this is the energy, and we use the MRRTT method :

and they accept a Monte Carlo move always if it decreases the energy and only with P=e^{-\frac{\Delta E}{k_bT}} if it increases the total energy of the system.

So, based on this, it seems to me, that if STAN is using log likelihood as Energy then it is simulating a physical system at constant temperature.

Right ?

Now the funny part comes : why the log likelihood is the “Energy” ? It could be anything else…

So if physics is a red herring (@betanalpha, which I don’t know what red herring is) then why is the log likelihood such an important business (which is basically the Maxwell Boltzmann distribution) ? My answer is that it comes from entropic considerations, if the only thing we know is that the total energy of a system is constant (which the Hamilton equations aim to model), then the maximum entropy principle forces us to use the Maxwell Boltzmann distribution, if we “work” in the , in other words, to use the log likelihood as the energy.

I find this connection to physics/entropy/information theory very interesting. I am not sure how obvious this is to people in the STAN community, but maybe it is, maybe not.

So, funnily, using the log likelihood as the Energy is the optimal approach, since before 1953, due to entropic considerations. Now the questions is, how important this is what it comes to Bayesian modelling – in general ? (and when using HMC to “do” Bayesian modelling – in practice). I would say it is essential, since using log likelihood as the energy is the “most” optimal choice due to entropic considerations (max entropy).

I mean, maybe this is obvious to everybody else than me, in the world, or not. If it is, then I am a bit “too slow” in understanding things, that’s fine, but if it is not, then is this interesting to the STAN community ? To understand why the Energy is the log likelihood ? Is it useful in practice to understand this, when trying to solve statistical problems with Bayesian approaches (especially with MCMC) ? I would think yes.

Anyway, I am just putting down these thoughts - as I am trying to understand MCMC/STAN. I might be talking about very trivial things, since I am totally newbie to MCMC, but just in case… I put out this simple thought.

Anyway… no big deal at all, no rush, just food for thought… maybe I am talking about something that is like 1+1 to everybody in this community and for me it was a big AHA moment :)

I hope I am not boring you to death with this physics to MCMC and back connections.

Have a good summer,

Cheers,

J.

It’s the negative log density—doesn’t matter where the density came from. For us, it’s usually a Bayesian posterior (log likelihood plus log prior).

You really want to read @betanalpha’s papers on arXiv for the connections to physics and an explanation of why things have to look the way they look.

His paper on adiabatic Monte Carlo gets into using different temperatures (heat bath style, not simulated annealing style, to preserve the physical analogy).

You should also pick up Livingstone, Faulkner and Robert’s paper on the kinetic energy distribution as that gets to some of the issues you’re asking about.

“Stan” is not an acronym.

1 Like

Thanks Bob, " why things have to look the way they look."

Yes. That is the super interesting question, it is just too much of a coincidence that
the way things look are so useful for calculating such integrals. There must be
something deeper there. In the sense of “no free lunch”.

Thinking about these things gave me a bit of better intiution about Bayesian inference
in general, especially how and why “Stan” (HMC), works at all, or for that matter, MRRTT to begin with.

Also, in particular, thinking about these things helped me to understand now a paper in a matter of minutes on Replica analysis of Bayesian data clustering, that I could not even begin to comprehend “what it was about” a year ago, after hours of thinking.

Bottom line, understanding and knowing about Stan, in itself, is very, very helpful in understanding Bayesian “inference” in general.

Bayesian “inference” becomes “more real”. Once one starts to contemplate the connection to physics and the why-s. Also, the other way around, the mysteries of statistical physics get clearer. Such as, why it is the Boltzmann distribution that describes the speed of chemical reactions and not something else ? It is connected to that “why is the negative log likelihood is the Energy” type of question. I find these connections fascinating.

Maybe a few years of reading the papers you suggested, Bob, and some other papers, I will be a bit better “intelectually equipped”, when it comes to these questions.

This direction (understanding Stan and its connection to physics) seems to make lot of sense to me, in pushing me towards a place where I can make a tiny bit of sense out of “Machine Learning” and its friends. I am happy that others have thought about these questions before and wrote papers about them. For example @betanalpha 's papers or maybe even more basic ones (which ones are more basic is maybe also a matter of “taste”/“history of eduction”), like :

Thanks for the links, Bob. Let’s see how I see these things once I become more “intelectually equipped” in the coming months/years, but even learning about Stan pushed my general understanding of Bayesian inference lightyears forward. Stan is the latin of ML :) , except for that it is still alive and kicking, unlike latin.

Cheers,

Jozsef

You mean the total energy distribution (kinetic + potential)?


Also, one more interesting thing I came across, the “reason why Stan exist” : entropy :) .

How did I came to this conclusion ? I watched this video on black holes : https://www.youtube.com/watch?v=2DIl3Hfh9tY .

Physics and “ML” are pretty strongly connected, I have to keep rediscovering this every time.

AFAI understand, Stan averages out “stuff” and “looses” information (when it does the integration), just like in thermodynamics / statistical physics. According to the Susskind lecture, loosing information is the same as generating entropy.

The point in stat-phys (and in prob. theory in general - including Bayesian - IMHO) is to get rid of information in a controlled manner and use the remaining information, again, “in a correct” / “best possible” way. This happens in Stan, the remaining information shows up as (is stored in) the posterior distribution (information that was not thrown away).


The max entropy principle is - for example - a way to derive the Gaussian distribution. Not large numbers. Max entropy. I only know that a distribution has a mean and a finite second moment, then I look for the distribution that maximises entropy. I get the Gaussian distribution. Quite a deep principle, and nicely connected to physics and information theory. I like this derivation, instead of deriving it from adding a lot of random variables, which does not really come from anywhere “physical” / “real”, I mean, where do those added random variables come from the first place ? Now we are back to square one, a question not answered, no underlying principle or physics or process or anything, “just” random variables. Wow!


What I am saying here is that it is maybe worth keeping the physics connection in mind - just a feeling - but if you like to watch Messenger Lectures then you might get this feeling too, after you watched a few of Susskind’s Messenger Lectures (which I was watching yesterday morning).

As Gauss realized, the normal distribution is also the one for which log density is proportional to distance in a Euclidean metric determined by the inverse covariance matrix.

If potential is fixed, then kinetic determines total energy.

The beauty of the central limit theorem is that it doesn’t matter as long as they’re independent (or form a Markov chain in the MCMC CLT).

Thanks Bob,

I need to think about these a bit, cannot comment on them right from the top of my head, seem to be interesting (and deep) points.

On the side, I just watched a video which is talking about hierarchical models / vision / predicting the world / how the brain works / generative models.

I think Stan people would find it interesting, so I post it here, into this thread, because I don’t really want to pollute the forum with too many different threads on “interesting things that relate to Stan”, but this video could be interesting to people who do not yet have seen it and use Stan for “something” :).

Cheers,

Jozsef

Actually, thinking about this story. It can be thought that there is a physical system, where each random variable is a particle and each random variable feels a potential.

(If the potential is x^2 then the random variable’s distribution is Gaussian and this corresponds to a harmonic oscillator, or a pendulum if the amplitude is small compared to the length of the rope, then the force is linear, in first approximation, as a function of coordinate x).

If the random variables are independent then their log likelihood is additive. This goes very nicely with the independence in physics. Two particles don’t feel each other if there is no interaction between them and the total energy of the system is only the sum of single particle energy terms.

I start to have the feeling that, for example, for the simple harmonic oscillator, the real world hamiltonian and the “Artificial Hamiltonian” are the same. There might be some formal correspondance between the two, something like “conjugate”, “dual”, “fourier transform”, and those kind of things come to mind (via free association). The Gaussian distribution is pretty special, it’s Fourier transform is itself, so, maybe the two Hamiltonians are only having the same form for this particular case (Harmonic oscillator).

Anyway, the main point is - here - the nice connection between independence in physics and independence in probabilistic sense. I wonder if this has some nice information theoretic angle to it ? I think yes, so I put this into my subconsious to work on it.

Hmm… ok. Just a few thoughts. Inspired by your comment, Bob.

Over, out.

J.