Hi Dear Reader,
I was walking in the kitchen a few hours ago and then “suddenly” a thought
struck me (which as been lurking in my subconscious for a few years now, for
sure, that is how these “sudden” realizations happen). So, here it comes. Enjoy !
Or hate it :) - just kidding, you will like it, it is solid.
Somewhat (a lot), related to Stan and “why it works”.
I have been thinking a bit about ML and why on earth it needs probability theory ?
“They say” (the voices in my head, lol, and the voices on the internet too), that entropy is
“lost information”. Yes. So why is probabilty theory good ?
Well, it “retains” as much information as possible and “uses it for something”.
BUT WHAT DOES IT USES IT FOR ?
It uses the “retained” but “almost lost” information (distribution) FOR LEARNING !!!
Ok, if this was obvious to you then please stop reading this post here. I will just bore you
to hell with it.
If not, then please hang on, the post is to be continued, but now I have to run.
EDIT - CONTINUATION (after 12 hours)
Let’s consider a classical harmonic oscillator : https://ocw.mit.edu/courses/nuclear-engineering/22-51-quantum-theory-of-radiation-interactions-fall-2012/lecture-notes/MIT22_51F12_Ch9.pdf
(this also has the quantum version - but let’s look at the large energy “limit” - classical limit).
The motion is described by X_0 and by the phase, but we only get a bunch of X-s, the phase information is lost. Now, we want to “infer” X_0.
If we are not using “probability theory”, then we NEED to sweep over all possible X_0-s and see if the measured X-s are consistent with our assumption of X_0, if not, then we need to try a new X_0.
The problem here. If we do not take into account “probability theory”, that is, all the information that we have, to GUIDE our search for finding X_0, then it is going to take a LONG TIME to find X_0.
If we consider a discrete situation then this time is finite (as in the case of Stan - since computers “discretize” things). However, if we use the information “IN THE BEST POSSIBLE WAY”, then the search for the X_0 takes much less time.
So basically, we are trading “computational steps” for information, when we start to use “probability theory”.
In other words : the stupid approach to find X_0 (in a discrete "simulation), is to try all possible X_0-s and see which one matches the experiment (perfectly).
However, if we use “probability theory” then we can speed up our search.
I think I will walk a bit more in the kitchen in the coming days, it might be educational to see how the simple Metropolis approach would work for this simple Harmonic oscillator case, if one wants to find X_0.
In other words, how does the real Hamiltonian of a physical system relate to the “Hamiltonian” used in Stan ? I wonder. If I find out something interesting in the coming weeks about this I post it here.
These are pretty simple questions, it is fun to consider them. I am a bit tired to answer this question now… I will think about it … the answer is not too complicated, but … need to sleep.
I will try to improve this post in the coming weeks, at the moment it is a bit “not to user friendly”.
TO BE CONTINUED … and most importantly TO BE IMPROVED …
(let’s consider this post as a “draft” for now - I will come back to it later - after a bit of thinking - with the purpose to clarify these thoughts - it is solid but not user friendly - I do admit that - so let’s wait with the answers/comments for a bit until I make this a bit more user friendly - kiitos/thanks)
PART 3 (the continuation) :
After a bit of back of the envelope calculations … the log likelihood
is proportional (if we have infinite samples) to the log of the potential energy of the original system. So for flat prior Stan is sampling exactly a distribution (for infinite samples) which is prescribed by the original Hamiltonian when the system is in a “heat bath” (canonical ensemble i.e. Boltzmann distribution - in this particular case). If the prior is Gaussian then that corresponds to a linear external force, that pulls the X'_0 (our guess for the “real” X_0) towards that external force. Hmm… interesting.
Why was the first ML course not started with this example ? Why did I have to wait 2019-2009 years for this example ? LOL.
Phew.
I am sure I “made” a mistake:
Sorry for the sloppy calculation here. I should not do this - but it’s summer and I am tired -
please excuse me, I promise, in my 2nd PhD I won’t do such “dirty” derivations.
I am sure these things are described in some books somewhere… but it was “easier” to “derive them”, again, the " " expresses self-irony. Also, it was more “fun” too. I hope I do not create much confusion here. This is basically just simple high school math which I carried out in an extremely sloppy way and most likely it is also wrong. The “tricky” part “was” the stat-phys part - but it’s also kinda high school-ish, the Boltzmann distribution, that is.
Ok, so this is just a tiny bit of inspiration perhaps interesting for someone who wants to contemplate the connection to physics on the simplest non trivial example on earth : harmonic oscillator.
Have a good summer. Sorry for the sloppy derivation, again.
Over and out.
Cheers,
Jozsef