Function call stack of default HMC with NUTS -> Shared memory parallelisation

I found this thread [(older) parallel AD tape ideas] and currently reading it, hoping it will shed more light onto how AD tapes are created and handled by the code.

The ad tape is stored in thread_local declared pointer.

Thank you! When I try to declare the AD tape and make it firstprivate in OpenMP like this:

      static stan::math::ChainableStack autodiff_tape;
 
      #pragma omp parallel for private(i, xparticles_target) \
      firstprivate(k, dt, num_params, autodiff_tape) schedule(static)
      for (i = 0; i < loc_n; i++) {
        ...
      }

I obtain the following compiler error:

error: use of deleted function ‘stan::math::AutodiffStackSingleton<ChainableT, ChainableAllocT>::AutodiffStackSingleton(const AutodiffStackSingleton_t&) [with ChainableT = stan::math::vari; ChainableAllocT = stan::math::chainable_alloc; AutodiffStackSingleton_t = stan::math::AutodiffStackSingleton<stan::math::vari, stan::math::chainable_alloc>]’
   53 |       #pragma omp parallel for private(i, xparticles_target) \
      |               ^~~
In file included from stan/lib/stan_math/stan/math/rev/core/chainablestack.hpp:4,
                 from stan/src/stan/smcs/hmc_proposal.hpp:19:
stan/lib/stan_math/stan/math/rev/core/autodiffstackstorage.hpp:115:12: note: declared here
  115 |   explicit AutodiffStackSingleton(AutodiffStackSingleton_t const &) = delete;
      |            ^~~~~~~~~~~~~~~~~~~~~~
make: *** [<builtin>: src/cmdstan/main.o] Error 1

It seems, I’m missing some steps. Probably, I need to initialise my autodiff_tape object. Can you please point me to an example, of how you do it in the code?

Also do I understand correctly, that in order to use threads in the code I need to set the STAN_THREADS variable to true? I’m not setting it at the moment.

You have to define STAN_THREADS and the chainable stack youmdeclare should be thread_local.

Thank you for your suggestion. I tried to declare the AD tape as thread_local and make it threadprivate for OpenMP. I followed the example on LLNL website (OpenMP Directives: THREADPRIVATE Directive | LLNL HPC Tutorials):

static thread_local stan::math::ChainableStack autodiff_tape;

// Make autodiff tape local to each thread
#pragma omp threadprivate(autodiff_tape)

// Disable dynamic threads explicitly
omp_set_dynamic(0);
 
#pragma omp parallel for private(i, xparticles_target) \
firstprivate(k, dt, num_params) schedule(static)
for (i = 0; i < loc_n; i++) {
    ...
}

However, this solution crashes the code when I run it with more than 2 threads. Maybe, thread_local and threadprivate conflict with each other? I will look for other alternatives.

I think you need the callback listed above to initialize the ad tape on each thread at the start. You should have something pretty similar to the observer class we have in Stan now but hooked into openmp

I tried to ensure each thread had their own AD tape. I created an overall parallel region and declared the tape in there:

#pragma omp parallel num_threads(T)
{
    static thread_local stan::math::ChainableStack autodiff_tape;

    #pragma omp for private(i, xparticles_target) \
    firstprivate(k, dt, num_params) schedule(static)
    for (i = 0; i < loc_n; i++)
    {
        ...
    }

This still leads to a crash by segmentation fault:

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x55ca9bbd91c0
[ 0] /usr/lib/libc.so.6(+0x38f50)[0x7fe45b63ff50]
[ 1] /usr/lib/libc.so.6(+0x153837)[0x7fe45b75a837]
[ 2] ./GLMM_Poisson2(+0x6d4b8)[0x55ca99c754b8]
[ 3] ./GLMM_Poisson2(+0x6d5f2)[0x55ca99c755f2]
[ 4] /usr/lib/libgomp.so.1(+0x1d476)[0x7fe45bcd9476]
[ 5] /usr/lib/libc.so.6(+0x85bb5)[0x7fe45b68cbb5]
[ 6] /usr/lib/libc.so.6(+0x107d90)[0x7fe45b70ed90]
*** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node smolya exited on signal 11 (Segmentation fault).

Yes, I will look for examples on how to use callbacks in OpenMP. As I understand callbacks are specifically designed for OpenMP tasks. What I have here is a traditional for loop.

I have been thinking about this. Potentially, I did not describe my goal correctly. Let me try to do it again:

  1. I decided to use the forward mode auto differentiation branch of the code (GitHub - stan-dev/stan at syclik/forward-mode) hoping that it would be more performant, i.e. will not create roadblocks and bottlenecks for the OpenMP pragmas. A quick test proven me wrong. My OpenMP instructions are still not scaling in the forward mode. I tried using 16 threads on my desktop computer, but I see only two threads being properly loaded with work in the htop process manager.

  2. My aim is to use shared memory parallelisation over the samples. I do not intend to use threads to parallelise the autodifferentiation chains. I would be satisfied with a single thread working on the autodifferentiation of its own number of samples.

  3. I believe it is not necessary to go into such depths as creating a dedicated OpenMP tool and linking against it in order to create a data structure to hold a collection of AD tapes. I don’t think mutexes and atomic updates of this data structure are needed in my case. The logic of my code would be to
    a) start an OpenMP parallel region,
    b) use a single thread, create a data structure with AD tapes, one AD tape per thread,
    c) enter the parallel for loop to run the computation making sure each thread uses an AD tape with its thread index,
    d) once all threads have completed the computation, free the memory of the data structure used for keeping the AD tapes.
    Therefore, I think these tasks can be accomplished by means of standard OpenMP pragmas.

  4. Potentially, I don’t understand the application of STAN_THREADS. Now it seems to me, that this variable enables parallel processing of a single AD tape with multiple threads. Can you please confirm this? Then I do not need to enable it for my use case. Thank you for your advice and attention so far.

You need STAN_THREADS defined, since you need for every thread you run a dedicated ad tape.

Actually @wds15 if it’s forward mode only idt he needs a tape?

Have you ran gdb to see where the thread is seg faulting at? Looks to be somewhere deep in libc from the error

I figured out the mistake. It was in my end of the code. Not related to Stan. I have just tried various ways of using a dedicated AD tape per thread. My best attempt so far is:

#pragma omp parallel
{
    // Autodiff tape
    static thread_local stan::math::ChainableStack autodiff_tape;

    #pragma omp for private(i, xparticles_target, xparticles_forward) \
    firstprivate(num_params, Tparams) schedule(static)
    for (i = 0; i < loc_n; i++)
    {
        ....
    }

I cannot declare the autodiff_tape before the parallel region and pass it to the omp parallel for block via private or firstprivate because it conflicts with the thread_local declaration. I cannot compile the code when I do that. Either way, I cannot get more than two threads working on the computation in the loop. I do not see any speed up past these two threads.

Can we please arrange a meeting over Zoom or Teams sometime next week? Then I will show you, what I’m trying to do.

My feeling is that the global structure that holds ChainableStacks for Intel TBB may conflict with the OpenMP declarations. This is my best educated guess.

I’m not understanding what this sentence means. Are you able to run more than two threads?

My feeling is that the global structure that holds ChainableStacks for Intel TBB may conflict with the OpenMP declarations. This is my best educated guess.

Tbb should only spin those up when it launches a thread.

Have you tried the openmp callback yet? I still think that is what you want. I would comment out the tbb observer and write a new class that does the same thing for openmp. Then register the function with openmp so it initializes the chain able stack at the start of each new thread (and another function for cleaning up the thread after it closes)

I tried running an experiment with 2, 4, 8 and 16 threads. There is a performance benefit between using 1 and 2 threads. But there is no difference when I use more threads. Performance stays the same as with 2 threads.

I cannot guarantee that somewhere in Stan code TBB won’t launch a thread.

I still think an OpenMP callback is an unnecessary overkill. But it seems, I have no other options.

Dear all,

How are you? I hope all is well with you. I finally implemented a little OpenMP Tool (OMPT) library. It uses OpenMP callbacks to allocate individual AD tapes per thread. The code runs and executes. Unfortunately, I don’t see any performance gain when I use multiple OpenMP threads. Please see a plot attached. I also attach a Stan model file for this experiment.

glmm_poisson2.pdf (9.5 KB)
GLMM_Poisson2.stan (1.3 KB)

I run the code using the following command:

mpiexec -np 1 ./GLMM_Poisson2 data file='GLMM_Poisson2.data.R' method=sample algorithm=smcs proposal=rw T=16 Tsmc=1000 num_samples=1024

The code snippets to register/deregister AD tapes are:

static void on_ompt_callback_thread_begin(
    ompt_thread_t thread_type,
    ompt_data_t  *thread_data)
{
    uint64_t tid = thread_data->value = my_next_id();

    counter[tid].cc.thread_begin += 1;

    printf("[%lu] Status: thread begun.\n", tid);

    observer_openmp.create_ad_tape();
    printf("[%lu] Status: AD tape created.\n", tid);
}

static void on_ompt_callback_thread_end(
    ompt_data_t *thread_data)
{
    uint64_t tid = thread_data->value;

    counter[tid].cc.thread_end += 1;
    printf("[%lu] Status: thread ended.\n", tid);

    observer_openmp.erase_ad_tape();
    printf("[%lu] Status: AD tape erased.\n", tid);
}

The header file for an OpenMP chainable stack is:

#ifndef STAN_MATH_REV_CORE_INIT_CHAINABLESTACK_OPENMP_HPP
#define STAN_MATH_REV_CORE_INIT_CHAINABLESTACK_OPENMP_HPP

#include <stan/math/rev/core/chainablestack.hpp>

// #include <tbb/task_scheduler_observer.h>

#include <mutex>
#include <unordered_map>
#include <utility>
#include <thread>
#include <tuple>

namespace stan
{
    namespace math
    {
        /**
         * OpenMP observer object which is a callback hook called whenever the
         * OpenMP runtime begins a new thread. This hook ensures that each
         * thread has an initialized AD tape ready for use.
         */
        class ad_tape_observer_openmp final
        {

            using stack_ptr = std::unique_ptr<ChainableStack>;
            using ad_map    = std::unordered_map<std::thread::id, stack_ptr>;

            public:

                ad_tape_observer_openmp() : thread_tape_map_() {
                    printf("Status: Creating AD tape observer (OpenMP).\n");
                }

                ~ad_tape_observer_openmp()
                {
                    printf("Status: Erasing AD tape observer (OpenMP).\n");
                }

                void create_ad_tape()
                {
                    printf("Observer Status: Creating AD tape for thread (OpenMP).\n");

                    std::lock_guard<std::mutex> thread_tape_map_lock(thread_tape_map_mutex_);
                    const std::thread::id thread_id = std::this_thread::get_id();

                    if (thread_tape_map_.find(thread_id) == thread_tape_map_.end())
                    {
                        ad_map::iterator insert_elem;
                        bool status = false;
                        std::tie(insert_elem, status)
                            = thread_tape_map_.emplace(ad_map::value_type{thread_id, nullptr});
                        insert_elem->second = std::make_unique<ChainableStack>();
                    }
                }

                void erase_ad_tape()
                {
                    printf("Observer Status: Erasing AD tape for thread (OpenMP).\n");

                    std::lock_guard<std::mutex> thread_tape_map_lock(thread_tape_map_mutex_);
                    auto elem = thread_tape_map_.find(std::this_thread::get_id());

                    if (elem != thread_tape_map_.end())
                    {
                        thread_tape_map_.erase(elem);
                    }
                }

            private:
                ad_map     thread_tape_map_;
                std::mutex thread_tape_map_mutex_;
        };
    }  // namespace math
}  // namespace stan

#endif

The OpenMP observer is a static variable defined inside the OpenMP tool as:

static stan::math::ad_tape_observer_openmp observer_openmp;

I would like to investigate this issue further. My idea now is to write the simplest possible Stan model with only one loop and one call to Stan’s log_prob() function. Then I can create another variation of this example without log_prob() replacing it with an equivalent custom code. Finally, I will be able to profile these two versions. If you have other ideas or suggestions I would be glad to hear them. Thank you very much for your help and attention!

Make sure you are using a model where it is worth to parallelise it. Have a look at the vignette from brms on within-chain parallelisation for an example. Also make sure that you do see speedups when using the regular Intel TBB based approach with reduce_sum and then switch over to OpenMP.

Thank you for your reply. Can you please share the link to this example model? I think my goal is different. I’m not aiming to parallelise within a chain. I would like to launch several threads working on their own independent sets of particles. Parallelisation within a chain can (should) be sequential.

Is this across chain parallelization or are the particles used to make the next step within a single chain? You may find the pr discussion for multiple chains in parallel useful

1 Like

We (a team that includes @mabalenk and me) are interesting in running multiple chains in parallel. We will take a look at that discussion.