I would like to test and potentially further develop shared memory performance of the forward mode automatic differentiation in Stan/Stan Math. I’m working with the syclik/forward-mode branch of Stan.
To use it I had to downgrade CmdStan to v.2.20. I can compile the code and run models. But I don’t fully understand the internal workings of the default algorithms. Would you be so kind to point me in the right direction? What are the main functions that are called repeatedly by default? I understand that it is HMC with the NUTS engine. But I can not find the main loop iterations in the code.
Why forward-mode will you ask me—because I expect it to be more performant. This is just a preliminary test to prove my expectation right. If it is true we want to introduce an alternative method to the default HMC. This method will also benefit from forward mode automatic differentiation and shared memory parallelism. Thank you and have a good day ahead!
I think you will find this series of blog posts by @jtimonen from last year helpful. I identifies the loop I think you’re searching for and generally breaks down a lot of what is happening in the code base.
Note that forward mode will not work on all the same models reverse mode does, due to things like the ode implicit functions. This is possible to fix.
I believe @s.maskell has also recently been looking into using forward mode in the Stan algorithms
@mabalenk: We have recently tested the forward mode autodiff branch on a large number of models and have run-time comparisons as well as an analysis of which models fail to compile etc. Our experiments indicate that the forward mode runtime is generally slower. However, our hope is that forward mode allows us to use GPU offloading with OpenMP and that should recover the lost runtime and get us a net speedup [edit: assuming we want to run lots of parallel chains]. Is that what you were aiming for? Either way, I sense we should meet to discuss!
We are aware of that but didn’t think it allowed for multiple chains to run efficiently in parallel and understood it was focused on within chain parallelism. When we have used openmp with the backward mode autodiff we get poor speedups. We think this is because of the way the tape is implemented.
With some quick googling I found out that you can bind OpenMP threads to specific CPU cores:
That should do the trick. All what is left to be done is to initialize the AD tape in a given thread (or just use one AD tape per evaluation in a given thread of a given function you want to autodiff…maybe ask @WardBrian for more info on his experience to get something related to work for the Stan backend he developed).
No black magic here :). What you do there is to make sure that there is an AD tape resource around in a given thread. If you were to use Intel TBB worker threads, then you would not need to do this.
… however… getting everything right with proper initialization exactly once per thread in a performant way… THAT was indeed a bizarre journey we had to take to find the right way to do this reliable, fast and cross-platform stable. Given the years this is now doing it’s job it was well worth the effort.
Yes, the particularly “magic” thing (to me anyway) is that we can create those multiple threads in another language (e.g. Julia) and this still works. That is also why we couldn’t use TBB, since we’re not making the threads necessarily
Thank you very much for all of your replies! I read Juho’s blog posts twice. They still do not touch upon the main loop over the samples. But the posts were very helpful! I found this loop in [stan/hmc_nuts_diag_e_adapt.hpp at develop · stan-dev/stan · GitHub]. Please see lines no. 263–277. It is already parallelised for shared memory using Intel’s TBB.
My questions now would be:
Did you stress test it?
How well does it work in forward mode and in reverse mode? (I still need to check TBB exists in the forward mode branch) → Yes, it is there. The loop is parallelised with TBB in forward mode branch.
Did you have a chance to compare it with OpenMP?
Why would you implement both TBB and OpenMP in Stan or OpenMP implementation is only BrigeStan specific?
If you have any performance results for shared memory, would you please share them with me? Thank you for your help and have a great day ahead!
This looks very promising as this is the same callback logic as we use for the TBB. It would probably be a nice addition to Stan-math which would make Stan OpenMP compatible. That said, there are a few more issues beyond initialisation, which is nesting and cleanup of things. Threads must take part in the same function evaluation and only nested ones derived from it can work on a set of threads (see isolation: Work Isolation — oneTBB documentation ). And once a function being autodiffed is completed and you have collected results, then you need to clean up memory.
Thank you for all of your replies and explanations. I see now that the ‘ad_tape_observer’ object is an important concept. I understand how it tracks the new threads via their ids. But I don’t see in the above source code the AD tape instance associated with each thread. What variable is it stored in?
My first naive implementation for OpenMP parallelism would be simply ensuring that each new thread has this AD tape instance. This can be achieved with
In other words, making this object private to a thread will ensure each thread has its own AD tape. Once this simple concept works, I can think of more elaborate strategies to keep all the tapes in check. Thank you and have a good day ahead!