Function call stack of default HMC with NUTS -> Shared memory parallelisation

mabalenk · February 13, 2023, 3:02pm

I found this thread [(older) parallel AD tape ideas] and currently reading it, hoping it will shed more light onto how AD tapes are created and handled by the code.

wds15 · February 13, 2023, 4:06pm

The ad tape is stored in thread_local declared pointer.

github.com

stan-dev/math/blob/133fd015ea0389287e45d6688075eaa77cecd23d/stan/math/rev/core/autodiffstackstorage.hpp#L140


      
                instance_ = new AutodiffStackStorage();
                return true;
              }
              return false;
            }
          
          
  bool own_instance_;
          };
          
          
template <typename ChainableT, typename ChainableAllocT>
          STAN_THREADS_DEF
              typename AutodiffStackSingleton<ChainableT,
                                              ChainableAllocT>::AutodiffStackStorage
                  *AutodiffStackSingleton<ChainableT, ChainableAllocT>::instance_;
          
          
}  // namespace math
          }  // namespace stan
          #endif

mabalenk · February 13, 2023, 4:27pm

Thank you! When I try to declare the AD tape and make it firstprivate in OpenMP like this:

      static stan::math::ChainableStack autodiff_tape;
 
      #pragma omp parallel for private(i, xparticles_target) \
      firstprivate(k, dt, num_params, autodiff_tape) schedule(static)
      for (i = 0; i < loc_n; i++) {
        ...
      }

I obtain the following compiler error:

error: use of deleted function ‘stan::math::AutodiffStackSingleton<ChainableT, ChainableAllocT>::AutodiffStackSingleton(const AutodiffStackSingleton_t&) [with ChainableT = stan::math::vari; ChainableAllocT = stan::math::chainable_alloc; AutodiffStackSingleton_t = stan::math::AutodiffStackSingleton<stan::math::vari, stan::math::chainable_alloc>]’
   53 |       #pragma omp parallel for private(i, xparticles_target) \
      |               ^~~
In file included from stan/lib/stan_math/stan/math/rev/core/chainablestack.hpp:4,
                 from stan/src/stan/smcs/hmc_proposal.hpp:19:
stan/lib/stan_math/stan/math/rev/core/autodiffstackstorage.hpp:115:12: note: declared here
  115 |   explicit AutodiffStackSingleton(AutodiffStackSingleton_t const &) = delete;
      |            ^~~~~~~~~~~~~~~~~~~~~~
make: *** [<builtin>: src/cmdstan/main.o] Error 1

It seems, I’m missing some steps. Probably, I need to initialise my autodiff_tape object. Can you please point me to an example, of how you do it in the code?

Also do I understand correctly, that in order to use threads in the code I need to set the STAN_THREADS variable to true? I’m not setting it at the moment.

wds15 · February 13, 2023, 7:44pm

You have to define STAN_THREADS and the chainable stack youmdeclare should be thread_local.

mabalenk · February 14, 2023, 12:18pm

Thank you for your suggestion. I tried to declare the AD tape as thread_local and make it threadprivate for OpenMP. I followed the example on LLNL website (OpenMP Directives: THREADPRIVATE Directive | LLNL HPC Tutorials):

static thread_local stan::math::ChainableStack autodiff_tape;

// Make autodiff tape local to each thread
#pragma omp threadprivate(autodiff_tape)

// Disable dynamic threads explicitly
omp_set_dynamic(0);
 
#pragma omp parallel for private(i, xparticles_target) \
firstprivate(k, dt, num_params) schedule(static)
for (i = 0; i < loc_n; i++) {
    ...
}

However, this solution crashes the code when I run it with more than 2 threads. Maybe, thread_local and threadprivate conflict with each other? I will look for other alternatives.

stevebronder · February 14, 2023, 12:55pm

I think you need the callback listed above to initialize the ad tape on each thread at the start. You should have something pretty similar to the observer class we have in Stan now but hooked into openmp

mabalenk · February 14, 2023, 3:41pm

I tried to ensure each thread had their own AD tape. I created an overall parallel region and declared the tape in there:

#pragma omp parallel num_threads(T)
{
    static thread_local stan::math::ChainableStack autodiff_tape;

    #pragma omp for private(i, xparticles_target) \
    firstprivate(k, dt, num_params) schedule(static)
    for (i = 0; i < loc_n; i++)
    {
        ...
    }

This still leads to a crash by segmentation fault:

*** Process received signal ***
Signal: Segmentation fault (11)
Signal code: Address not mapped (1)
Failing at address: 0x55ca9bbd91c0
[ 0] /usr/lib/libc.so.6(+0x38f50)[0x7fe45b63ff50]
[ 1] /usr/lib/libc.so.6(+0x153837)[0x7fe45b75a837]
[ 2] ./GLMM_Poisson2(+0x6d4b8)[0x55ca99c754b8]
[ 3] ./GLMM_Poisson2(+0x6d5f2)[0x55ca99c755f2]
[ 4] /usr/lib/libgomp.so.1(+0x1d476)[0x7fe45bcd9476]
[ 5] /usr/lib/libc.so.6(+0x85bb5)[0x7fe45b68cbb5]
[ 6] /usr/lib/libc.so.6(+0x107d90)[0x7fe45b70ed90]
*** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpiexec noticed that process rank 0 with PID 0 on node smolya exited on signal 11 (Segmentation fault).

Yes, I will look for examples on how to use callbacks in OpenMP. As I understand callbacks are specifically designed for OpenMP tasks. What I have here is a traditional for loop.

mabalenk · February 15, 2023, 1:20pm

I have been thinking about this. Potentially, I did not describe my goal correctly. Let me try to do it again:

I decided to use the forward mode auto differentiation branch of the code (GitHub - stan-dev/stan at syclik/forward-mode) hoping that it would be more performant, i.e. will not create roadblocks and bottlenecks for the OpenMP pragmas. A quick test proven me wrong. My OpenMP instructions are still not scaling in the forward mode. I tried using 16 threads on my desktop computer, but I see only two threads being properly loaded with work in the htop process manager.
My aim is to use shared memory parallelisation over the samples. I do not intend to use threads to parallelise the autodifferentiation chains. I would be satisfied with a single thread working on the autodifferentiation of its own number of samples.
I believe it is not necessary to go into such depths as creating a dedicated OpenMP tool and linking against it in order to create a data structure to hold a collection of AD tapes. I don’t think mutexes and atomic updates of this data structure are needed in my case. The logic of my code would be to
a) start an OpenMP parallel region,
b) use a single thread, create a data structure with AD tapes, one AD tape per thread,
c) enter the parallel for loop to run the computation making sure each thread uses an AD tape with its thread index,
d) once all threads have completed the computation, free the memory of the data structure used for keeping the AD tapes.
Therefore, I think these tasks can be accomplished by means of standard OpenMP pragmas.
Potentially, I don’t understand the application of STAN_THREADS. Now it seems to me, that this variable enables parallel processing of a single AD tape with multiple threads. Can you please confirm this? Then I do not need to enable it for my use case. Thank you for your advice and attention so far.

wds15 · February 15, 2023, 2:13pm

You need STAN_THREADS defined, since you need for every thread you run a dedicated ad tape.

stevebronder · February 15, 2023, 5:54pm

Actually @wds15 if it’s forward mode only idt he needs a tape?

stevebronder · February 15, 2023, 5:56pm

Have you ran gdb to see where the thread is seg faulting at? Looks to be somewhere deep in libc from the error

mabalenk · February 17, 2023, 4:25pm

I figured out the mistake. It was in my end of the code. Not related to Stan. I have just tried various ways of using a dedicated AD tape per thread. My best attempt so far is:

#pragma omp parallel
{
    // Autodiff tape
    static thread_local stan::math::ChainableStack autodiff_tape;

    #pragma omp for private(i, xparticles_target, xparticles_forward) \
    firstprivate(num_params, Tparams) schedule(static)
    for (i = 0; i < loc_n; i++)
    {
        ....
    }

I cannot declare the autodiff_tape before the parallel region and pass it to the omp parallel for block via private or firstprivate because it conflicts with the thread_local declaration. I cannot compile the code when I do that. Either way, I cannot get more than two threads working on the computation in the loop. I do not see any speed up past these two threads.

Can we please arrange a meeting over Zoom or Teams sometime next week? Then I will show you, what I’m trying to do.

My feeling is that the global structure that holds ChainableStacks for Intel TBB may conflict with the OpenMP declarations. This is my best educated guess.

stevebronder · February 17, 2023, 5:32pm

I’m not understanding what this sentence means. Are you able to run more than two threads?

My feeling is that the global structure that holds ChainableStacks for Intel TBB may conflict with the OpenMP declarations. This is my best educated guess.

Tbb should only spin those up when it launches a thread.

Have you tried the openmp callback yet? I still think that is what you want. I would comment out the tbb observer and write a new class that does the same thing for openmp. Then register the function with openmp so it initializes the chain able stack at the start of each new thread (and another function for cleaning up the thread after it closes)

mabalenk · February 17, 2023, 6:57pm

I tried running an experiment with 2, 4, 8 and 16 threads. There is a performance benefit between using 1 and 2 threads. But there is no difference when I use more threads. Performance stays the same as with 2 threads.

I cannot guarantee that somewhere in Stan code TBB won’t launch a thread.

I still think an OpenMP callback is an unnecessary overkill. But it seems, I have no other options.

mabalenk · April 4, 2023, 1:22pm

Dear all,

How are you? I hope all is well with you. I finally implemented a little OpenMP Tool (OMPT) library. It uses OpenMP callbacks to allocate individual AD tapes per thread. The code runs and executes. Unfortunately, I don’t see any performance gain when I use multiple OpenMP threads. Please see a plot attached. I also attach a Stan model file for this experiment.

glmm_poisson2.pdf (9.5 KB)
GLMM_Poisson2.stan (1.3 KB)

I run the code using the following command:

mpiexec -np 1 ./GLMM_Poisson2 data file='GLMM_Poisson2.data.R' method=sample algorithm=smcs proposal=rw T=16 Tsmc=1000 num_samples=1024

The code snippets to register/deregister AD tapes are:

static void on_ompt_callback_thread_begin(
    ompt_thread_t thread_type,
    ompt_data_t  *thread_data)
{
    uint64_t tid = thread_data->value = my_next_id();

    counter[tid].cc.thread_begin += 1;

    printf("[%lu] Status: thread begun.\n", tid);

    observer_openmp.create_ad_tape();
    printf("[%lu] Status: AD tape created.\n", tid);
}

static void on_ompt_callback_thread_end(
    ompt_data_t *thread_data)
{
    uint64_t tid = thread_data->value;

    counter[tid].cc.thread_end += 1;
    printf("[%lu] Status: thread ended.\n", tid);

    observer_openmp.erase_ad_tape();
    printf("[%lu] Status: AD tape erased.\n", tid);
}

The header file for an OpenMP chainable stack is:

#ifndef STAN_MATH_REV_CORE_INIT_CHAINABLESTACK_OPENMP_HPP
#define STAN_MATH_REV_CORE_INIT_CHAINABLESTACK_OPENMP_HPP

#include <stan/math/rev/core/chainablestack.hpp>

// #include <tbb/task_scheduler_observer.h>

#include <mutex>
#include <unordered_map>
#include <utility>
#include <thread>
#include <tuple>

namespace stan
{
    namespace math
    {
        /**
         * OpenMP observer object which is a callback hook called whenever the
         * OpenMP runtime begins a new thread. This hook ensures that each
         * thread has an initialized AD tape ready for use.
         */
        class ad_tape_observer_openmp final
        {

            using stack_ptr = std::unique_ptr<ChainableStack>;
            using ad_map    = std::unordered_map<std::thread::id, stack_ptr>;

            public:

                ad_tape_observer_openmp() : thread_tape_map_() {
                    printf("Status: Creating AD tape observer (OpenMP).\n");
                }

                ~ad_tape_observer_openmp()
                {
                    printf("Status: Erasing AD tape observer (OpenMP).\n");
                }

                void create_ad_tape()
                {
                    printf("Observer Status: Creating AD tape for thread (OpenMP).\n");

                    std::lock_guard<std::mutex> thread_tape_map_lock(thread_tape_map_mutex_);
                    const std::thread::id thread_id = std::this_thread::get_id();

                    if (thread_tape_map_.find(thread_id) == thread_tape_map_.end())
                    {
                        ad_map::iterator insert_elem;
                        bool status = false;
                        std::tie(insert_elem, status)
                            = thread_tape_map_.emplace(ad_map::value_type{thread_id, nullptr});
                        insert_elem->second = std::make_unique<ChainableStack>();
                    }
                }

                void erase_ad_tape()
                {
                    printf("Observer Status: Erasing AD tape for thread (OpenMP).\n");

                    std::lock_guard<std::mutex> thread_tape_map_lock(thread_tape_map_mutex_);
                    auto elem = thread_tape_map_.find(std::this_thread::get_id());

                    if (elem != thread_tape_map_.end())
                    {
                        thread_tape_map_.erase(elem);
                    }
                }

            private:
                ad_map     thread_tape_map_;
                std::mutex thread_tape_map_mutex_;
        };
    }  // namespace math
}  // namespace stan

#endif

The OpenMP observer is a static variable defined inside the OpenMP tool as:

static stan::math::ad_tape_observer_openmp observer_openmp;

I would like to investigate this issue further. My idea now is to write the simplest possible Stan model with only one loop and one call to Stan’s log_prob() function. Then I can create another variation of this example without log_prob() replacing it with an equivalent custom code. Finally, I will be able to profile these two versions. If you have other ideas or suggestions I would be glad to hear them. Thank you very much for your help and attention!

wds15 · April 5, 2023, 7:59am

Make sure you are using a model where it is worth to parallelise it. Have a look at the vignette from brms on within-chain parallelisation for an example. Also make sure that you do see speedups when using the regular Intel TBB based approach with reduce_sum and then switch over to OpenMP.

mabalenk · April 5, 2023, 9:48am

Thank you for your reply. Can you please share the link to this example model? I think my goal is different. I’m not aiming to parallelise within a chain. I would like to launch several threads working on their own independent sets of particles. Parallelisation within a chain can (should) be sequential.

stevebronder · April 5, 2023, 2:10pm

Is this across chain parallelization or are the particles used to make the next step within a single chain? You may find the pr discussion for multiple chains in parallel useful

github.com/stan-dev/cmdstan

Running multiple chains in one Stan program

stan-dev:develop ← stan-dev:proto/multi-chain

opened 07:07AM - 06 Mar 21 UTC

SteveBronder

+613 -200

#### Submisison Checklist - [x] Run tests: `./runCmdStanTests.py src/test` -… [x] Declare copyright holder and open-source license: see below #### Summary: This is a prototype that can utilize a `chains` and `n_chains` argument to cmdstan so that a single stan program can run multiple chains in parallel like the below ```bash one_comp sample data file=one_comp.data.R chains n_chains=8 ``` #### The Good It works and gives back correct results! Comparing the results from the multi chain program to single chain programs. This also allows the data in the model to be shared across chains so it should be nice on memory consumption This also seems to be a bit faster which is nice (see `dawid`) test [here](https://github.com/stan-dev/cmdstan/pull/987#issuecomment-794574432) #### The Bad I'm not totally sure how to implement this at the `stan` level. Should we have `_par()` methods for all the services? With multiple chains do we also need to think about parallel PRNGs? My main Q is whether there's enough clever stuff we can do once we have multiple chains in one Stan program to make up for any performance loss. A couple ideas are 1. @wds15 's parallel idea for tree building 2. @bbbales2 and @yizhang-yiz Cross chain warmup 3. The new variational inference algorithms Akash/Aki/JHuggins/Hyunji/Shin and crew are working on (I think this is also a multi-chain thing) 4. Maybe useful for the adaptation algorithms Lu is working on? 5. Allow OpenCL to share data on the GPU across multiple chains (and generally reduced memory consumption) #### The Ugly This is a rough prototype. While the results are correct there's probably some bads happening with the rng. I also wonder about writing to disk across multiple threads. It feels like that could thrash the I/O #### How to Verify: You can run the examples I've put in the `examples` folder like ``` # make sure STAN_THREADS=true is defined in make/local # and # TBB_LIBRARIES=tbb tbbmalloc tbbmalloc_proxy # for tbb_malloc export STAN_NUM_THREADS=8 make examples/dawid/dawid examples/dawid/dawid sample num_samples=150 num_warmup=150 data file=examples/dawid/caries_dump.R chains n_chains=8 make examples/one_comp/one_comp sample examples/one_comp/one_comp sample num_samples=150 num_warmup=150 data file=examples/one_comp/one_comp.data.R chains n_chains=8 ``` Or use perf to check it out like ``` export STAN_NUM_THREADS=8 # If you get a warning about lost chunks turn down freq perf record -s --per-thread --freq=2800 -g -o test_perf.data examples/dawid/dawid sample num_samples=150 num_warmup=150 data file=examples/dawid/caries_dump.R chains n_chains=8 # -g prints caller graph, -T by thread perf report -g -T -i ./test_perf.data perf record -s --per-thread --freq=2800 -g -o test_perf2.data examples/one_comp/one_comp sample num_samples=150 num_warmup=150 data file=examples/one_comp/one_comp.data.R chains n_chains=8 perf report -g -T -i ./test_perf2.data ``` dawid seems to be making a ton of calls to malloc ints. I still need to dive deeper into the perf to figure out where / why that's happening. one_comp is spending most of it's time in `boost::math::lanczos::lanczos_sum` which is called from lgamma #### Documentation: #### Copyright and Licensing Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Steve Bronder By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses: - Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause) - Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

s.maskell · April 7, 2023, 8:43am

We (a team that includes @mabalenk and me) are interesting in running multiple chains in parallel. We will take a look at that discussion.

Topic		Replies	Views
How to compile and profile NUTS in stan/src/stan/mcmc/hmc/nuts CmdStan	2	523	November 8, 2022
Higher-Order Autodiff, testing framework & RHMC Developers	49	3356	May 11, 2019
Choice of sampler General	3	454	December 11, 2018
Using Stan HMC as Metropolis-Within-Gibbs step in C++ Interfaces	2	836	March 22, 2020
Question about the HMC algorithm in rstan General rstan	14	942	January 19, 2021

Function call stack of default HMC with NUTS -> Shared memory parallelisation

Related topics