Why do we even need probability theory ? (Rhetorical question)

Hi Dear Reader,

I was walking in the kitchen a few hours ago and then “suddenly” a thought
struck me (which as been lurking in my subconscious for a few years now, for
sure, that is how these “sudden” realizations happen). So, here it comes. Enjoy !
Or hate it :) - just kidding, you will like it, it is solid.

Somewhat (a lot), related to Stan and “why it works”.

I have been thinking a bit about ML and why on earth it needs probability theory ?

“They say” (the voices in my head, lol, and the voices on the internet too), that entropy is
“lost information”. Yes. So why is probabilty theory good ?

Well, it “retains” as much information as possible and “uses it for something”.


It uses the “retained” but “almost lost” information (distribution) FOR LEARNING !!!

Ok, if this was obvious to you then please stop reading this post here. I will just bore you
to hell with it.

If not, then please hang on, the post is to be continued, but now I have to run.

EDIT - CONTINUATION (after 12 hours)

Let’s consider a classical harmonic oscillator : https://ocw.mit.edu/courses/nuclear-engineering/22-51-quantum-theory-of-radiation-interactions-fall-2012/lecture-notes/MIT22_51F12_Ch9.pdf
(this also has the quantum version - but let’s look at the large energy “limit” - classical limit).

The motion is described by X_0 and by the phase, but we only get a bunch of X-s, the phase information is lost. Now, we want to “infer” X_0.

If we are not using “probability theory”, then we NEED to sweep over all possible X_0-s and see if the measured X-s are consistent with our assumption of X_0, if not, then we need to try a new X_0.

The problem here. If we do not take into account “probability theory”, that is, all the information that we have, to GUIDE our search for finding X_0, then it is going to take a LONG TIME to find X_0.

If we consider a discrete situation then this time is finite (as in the case of Stan - since computers “discretize” things). However, if we use the information “IN THE BEST POSSIBLE WAY”, then the search for the X_0 takes much less time.

So basically, we are trading “computational steps” for information, when we start to use “probability theory”.

In other words : the stupid approach to find X_0 (in a discrete "simulation), is to try all possible X_0-s and see which one matches the experiment (perfectly).

However, if we use “probability theory” then we can speed up our search.

I think I will walk a bit more in the kitchen in the coming days, it might be educational to see how the simple Metropolis approach would work for this simple Harmonic oscillator case, if one wants to find X_0.

In other words, how does the real Hamiltonian of a physical system relate to the “Hamiltonian” used in Stan ? I wonder. If I find out something interesting in the coming weeks about this I post it here.

These are pretty simple questions, it is fun to consider them. I am a bit tired to answer this question now… I will think about it … the answer is not too complicated, but … need to sleep.

I will try to improve this post in the coming weeks, at the moment it is a bit “not to user friendly”.

TO BE CONTINUED … and most importantly TO BE IMPROVED …

(let’s consider this post as a “draft” for now - I will come back to it later - after a bit of thinking - with the purpose to clarify these thoughts - it is solid but not user friendly - I do admit that - so let’s wait with the answers/comments for a bit until I make this a bit more user friendly - kiitos/thanks)

PART 3 (the continuation) :

After a bit of back of the envelope calculations … the log likelihood
is proportional (if we have infinite samples) to the log of the potential energy of the original system. So for flat prior Stan is sampling exactly a distribution (for infinite samples) which is prescribed by the original Hamiltonian when the system is in a “heat bath” (canonical ensemble i.e. Boltzmann distribution - in this particular case). If the prior is Gaussian then that corresponds to a linear external force, that pulls the X'_0 (our guess for the “real” X_0) towards that external force. Hmm… interesting.

Why was the first ML course not started with this example ? Why did I have to wait 2019-2009 years for this example ? LOL.


I am sure I “made” a mistake:

Sorry for the sloppy calculation here. I should not do this - but it’s summer and I am tired -
please excuse me, I promise, in my 2nd PhD I won’t do such “dirty” derivations.

I am sure these things are described in some books somewhere… but it was “easier” to “derive them”, again, the " " expresses self-irony. Also, it was more “fun” too. I hope I do not create much confusion here. This is basically just simple high school math which I carried out in an extremely sloppy way and most likely it is also wrong. The “tricky” part “was” the stat-phys part - but it’s also kinda high school-ish, the Boltzmann distribution, that is.

Ok, so this is just a tiny bit of inspiration perhaps interesting for someone who wants to contemplate the connection to physics on the simplest non trivial example on earth : harmonic oscillator.

Have a good summer. Sorry for the sloppy derivation, again.

Over and out.



UPDATE, apparantly it is not only me who likes to ask such rhetorical questions : https://arxiv.org/pdf/1906.01836.pdf :

… if one restricts oneself to classical physics, its laws are deterministic, so one might ask: where do these probabilities come from?

How to make sense of objective probabilities in a deterministic universe?

And if those probabilities are in some sense “subjective”, namely assigned by us to events, and not “intrinsic” to those events, how can one say that macroscopic laws are objective?

thanks to Alexander for this tip on this interesting paper !

Btw, some of Alexander’s papers also try to look at Bayesian “statistics” from statistical physical point of view ( https://arxiv.org/abs/1810.02627 ) :

We use statistical mechanics to study model-based Bayesian data clustering. In this approach, each partition of the data into clusters is regarded as a microscopic system state, the negative data log-likelihood gives the energy of each state, and the data set realisation acts as disorder. Optimal clustering corresponds to the ground state of the system, and is hence obtained from the free energy via a low `temperature’ limit. We assume that for large sample sizes the free energy density is self-averaging, and we use the replica method to compute the asymptotic free energy density. The main order parameter in the resulting (replica symmetric) theory, the distribution of the data over the clusters, satisfies a self-consistent equation which can be solved by a population dynamics algorithm. From this order parameter one computes the average free energy, and all relevant macroscopic characteristics of the problem. The theory describes numerical experiments perfectly, and gives a significant improvement over the mean-field theory that was used to study this model in past."

This might be interesting read for people who use Stan for clustering. Say GMM-s and such. Might explain why Stan works, when it works, from a statistical physical point of view. I am no Stan expert - so maybe these things described in this paper are obvious to Stan experts. I just had a quick look and it seems that there is a pretty strong connection to MCMC. No wonder… really. They just use plain stat phys methods for Bayesian clustering, like HMC … a plain stat phys method for numerical tackling of stat phys problems. So … I am pretty sure that this paper says nothing new to Stan experts, but if I were to write a PhD about Stan I would be very interested to learn about how this paper can help explain what Stan is really doing when it does “MCMC” - after all, MCMC is just numerical stat. phys technique - which is a wonder that it even works - but luckily Alexander’s paper goes way beyond that, particularly applied to Bayesian clustering. GMMs and friends.

Also, his paper does not seem to be a very difficult read either, especially after reading this basic intro to the replica method : https://arxiv.org/abs/cond-mat/0505032 .

Most likely some of the things in this paper can be pretty obvious to many ppl here … but just in case … if it is not, then it could provide some food for thought for ppl who want to understand how/why Stan works (like me) - demonstrated on simple problems, such as GMM clustering.