Function call stack of default HMC with NUTS -> Shared memory parallelisation

Dear all,

I would like to test and potentially further develop shared memory performance of the forward mode automatic differentiation in Stan/Stan Math. I’m working with the syclik/forward-mode branch of Stan.
To use it I had to downgrade CmdStan to v.2.20. I can compile the code and run models. But I don’t fully understand the internal workings of the default algorithms. Would you be so kind to point me in the right direction? What are the main functions that are called repeatedly by default? I understand that it is HMC with the NUTS engine. But I can not find the main loop iterations in the code.

Why forward-mode will you ask me—because I expect it to be more performant. This is just a preliminary test to prove my expectation right. If it is true we want to introduce an alternative method to the default HMC. This method will also benefit from forward mode automatic differentiation and shared memory parallelism. Thank you and have a good day ahead!


Best wishes,
Maxim

1 Like

@betanalpha @andrjohns @WardBrian @syclik

I think you will find this series of blog posts by @jtimonen from last year helpful. I identifies the loop I think you’re searching for and generally breaks down a lot of what is happening in the code base.

Note that forward mode will not work on all the same models reverse mode does, due to things like the ode implicit functions. This is possible to fix.

I believe @s.maskell has also recently been looking into using forward mode in the Stan algorithms

3 Likes

@mabalenk: We have recently tested the forward mode autodiff branch on a large number of models and have run-time comparisons as well as an analysis of which models fail to compile etc. Our experiments indicate that the forward mode runtime is generally slower. However, our hope is that forward mode allows us to use GPU offloading with OpenMP and that should recover the lost runtime and get us a net speedup [edit: assuming we want to run lots of parallel chains]. Is that what you were aiming for? Either way, I sense we should meet to discuss!

1 Like

Why would you believe that to be true? Generally reverse will be faster for functions where we have a single output and many inputs.

Have you tried using Stan math’s OpenCL implementation?

We are aware of that but didn’t think it allowed for multiple chains to run efficiently in parallel and understood it was focused on within chain parallelism. When we have used openmp with the backward mode autodiff we get poor speedups. We think this is because of the way the tape is implemented.

1 Like

Did you make sure with the openmp tests that you have one tape per thread? That’s critical for performance!

2 Likes

We didn’t know how to. Is that easy? If so, that would be amazing.

1 Like

@wds15: It transpires that @mabalenk is my colleague! We would be very interested in your views on this topic.

1 Like

With some quick googling I found out that you can bind OpenMP threads to specific CPU cores:

That should do the trick. All what is left to be done is to initialize the AD tape in a given thread (or just use one AD tape per evaluation in a given thread of a given function you want to autodiff…maybe ask @WardBrian for more info on his experience to get something related to work for the Stan backend he developed).

1 Like

Ok but it is the ad bit that i think we are stuck on. Let’s hope @WardBrian can offer some steer.

In BridgeStan we just declare a ChainableStack as static thread_local, in the function we call gradient inside of, see:

To be completely honest this particular thing is closer to “magic incantation” than “code I fully understand” for me

2 Likes

No black magic here :). What you do there is to make sure that there is an AD tape resource around in a given thread. If you were to use Intel TBB worker threads, then you would not need to do this.

… however… getting everything right with proper initialization exactly once per thread in a performant way… THAT was indeed a bizarre journey we had to take to find the right way to do this reliable, fast and cross-platform stable. Given the years this is now doing it’s job it was well worth the effort.

3 Likes

Yes, the particularly “magic” thing (to me anyway) is that we can create those multiple threads in another language (e.g. Julia) and this still works. That is also why we couldn’t use TBB, since we’re not making the threads necessarily

2 Likes

Dear all,

Thank you very much for all of your replies! I read Juho’s blog posts twice. They still do not touch upon the main loop over the samples. But the posts were very helpful! I found this loop in [stan/hmc_nuts_diag_e_adapt.hpp at develop · stan-dev/stan · GitHub]. Please see lines no. 263–277. It is already parallelised for shared memory using Intel’s TBB.

My questions now would be:

  1. Did you stress test it?
  2. How well does it work in forward mode and in reverse mode? (I still need to check TBB exists in the forward mode branch) → Yes, it is there. The loop is parallelised with TBB in forward mode branch.
  3. Did you have a chance to compare it with OpenMP?
  4. Why would you implement both TBB and OpenMP in Stan or OpenMP implementation is only BrigeStan specific?

If you have any performance results for shared memory, would you please share them with me? Thank you for your help and have a great day ahead!


Best wishes,
Maxim

Would you please elaborate on this? Why is there a difference between TBB and OpenMP? Thank you!

For tbb threads we register an observer thinggy which gets called whenever a tbb thread is created or destroyed, see

This observer ensures that every tbb worker thread has a ready to use ad tape. For openmp this sort of resource management does not happen (but maybe one can configure it?).

…and yeah, threads are an os thing which you can control from different languages.

1 Like

I don’t know openmp well enough but I think you can register a callback on the start of a thread via the initial-task-begin event hook?

https://www.openmp.org/spec-html/5.0/openmpsu50.html

This looks very promising as this is the same callback logic as we use for the TBB. It would probably be a nice addition to Stan-math which would make Stan OpenMP compatible. That said, there are a few more issues beyond initialisation, which is nesting and cleanup of things. Threads must take part in the same function evaluation and only nested ones derived from it can work on a set of threads (see isolation: Work Isolation — oneTBB documentation ). And once a function being autodiffed is completed and you have collected results, then you need to clean up memory.

Dear all,

Thank you for all of your replies and explanations. I see now that the ‘ad_tape_observer’ object is an important concept. I understand how it tracks the new threads via their ids. But I don’t see in the above source code the AD tape instance associated with each thread. What variable is it stored in?

My first naive implementation for OpenMP parallelism would be simply ensuring that each new thread has this AD tape instance. This can be achieved with

#pragma omp parallel private(ad_tape_instance) {

    // instantiate AD tape
    init_ad_tape(...);
    ...
}

In other words, making this object private to a thread will ensure each thread has its own AD tape. Once this simple concept works, I can think of more elaborate strategies to keep all the tapes in check. Thank you and have a good day ahead!