STAN_NUM_THREADS and num_threads

I’m trying to make sense out of below results and figure out how to deal with STAN_NUM_THREADS and num_threads in my StanSample.jl Julia package.

My questions:

  1. Is it correct that if the STANS_NUM_THREADS environment variable is not set, when the command line argument num_threads is used it will still report NUM_THREADS=1 in the .csv files, e.g. fitzhughnagumo num_threads=4 sample ...?

  2. Currently in StanSample.jl STAN_NUM_THREADS is used if defined. For a set of tests I changed that to let STAN_NUM_THREADS (“SNT” in below table) follow num_threads. In below tables STAN_NUM_THREADS and num_threads (“threads” in the table) are always identical. This is also reflected in the chain .csv files NUM_THREADS value. Is this a correct way to manipulate STAN_NUM_THREADS?

  3. In below table I draw 10000 samples in either a single chain, 2500 samples in 4 chains or 1250 samples in 8 chains, combine the chains in a single DataFrame and record 4 elapsed times and compute the mean of these elapsed times. I believe this is a correct way to obtain 10000 draws?

  4. The results in the table are interesting . I had expected cmdstan needs to provide threads (or maybe Julia?). But that seems not to be the case. My conclusion is I get the benefits of Julia’s run() command (which runs as julia’s immediate child process, using fork and exec calls).

  5. If that is the case would this be an option internally in Stan’s cmdstan?

Any feedback is highly appreciated. The current test is an example in DiffEqBayesStan.jl. I also want to use the red cards data and do a similar set of tests and also later on check some of this on the Julia forum.

-------------------------- TABLE --------------------

JULIA_NUM_THREADS = 10

9×9 DataFrame

 Row │ SNT    threads  chains  samples  time_1    time_2    time_3    time_4   mean     

   1 │     1        1       1    10000  42.3248   36.6654   39.8239   37.3989  39.0532

   2 │     1        1       4    10000  16.8682   14.8487   14.0331   14.8553  15.1513

   3 │     1        1       8    10000   9.83072   9.72722   9.10842  11.134    9.95008

   4 │     4        4       1    10000  36.1733   42.266    34.3276   39.0783  37.9613

   5 │     4        4       4    10000  15.3396   16.7139   13.6394   14.3342  15.0068

   6 │     4        4       8    10000  10.0435    9.98177   9.82626  10.246   10.0244

   7 │     8        8       1    10000  48.2951   34.1351   38.4091   35.1433  38.9956

   8 │     8        8       4    10000  16.6352   14.9329   13.9775   15.2577  15.2008

   9 │     8        8       8    10000   9.74233  10.0634    9.70536  10.0111   9.88054

JULIA_NUM_THREADS = 4

9×9 DataFrame

 Row │ SNT    threads  chains  samples  time_1    time_2   time_3    time_4   mean    

   1 │     1        1       1    10000  38.8417   34.8116  40.922    35.1548  37.4325

   2 │     1        1       4    10000  14.3087   14.1105  14.6233   14.6406  14.4208

   3 │     1        1       8    10000  10.0635   10.2226  10.6875   10.2459  10.3049

   4 │     4        4       1    10000  36.8625   52.1193  40.957    40.2237  42.5406

   5 │     4        4       4    10000  13.7354   14.5204  17.5627   14.3537  15.0431

   6 │     4        4       8    10000   9.50104  10.3231  10.7328   11.1363  10.4233

   7 │     8        8       1    10000  42.508    43.9822  34.1056   34.2806  38.7191

   8 │     8        8       4    10000  16.2068   14.4028  15.0074   16.6844  15.5754

   9 │     8        8       8    10000  10.2654   10.8903   9.45841  10.4184  10.2581

JULIA_NUM_THREADS = 1

9×9 DataFrame

 Row │ SNT    threads  chains  samples  time_1    time_2   time_3    time_4   mean    

   1 │     1        1       1    10000  31.6349   32.2269   39.5003   33.8465   34.3021

   2 │     1        1       4    10000  13.1042   12.5378   12.0559   12.1937   12.4729

   3 │     1        1       8    10000   8.65819   8.85462   8.29159   8.5041    8.57712

   4 │     4        4       1    10000  32.9886   38.8642   36.1392   41.9363   37.4821

   5 │     4        4       4    10000  12.4424   13.13     12.3069   12.0543   12.4834

   6 │     4        4       8    10000   9.39871   9.1411    8.82557   8.17123   8.88415

   7 │     8        8       1    10000  38.5501   36.1817   33.7262   33.6519   35.5275

   8 │     8        8       4    10000  12.1651   14.5109   14.3144   12.7943   13.4462

   9 │     8        8       8    10000   9.12816   8.51613   8.50016   8.5085    8.66324

I would recommend to only use the num_threads argument going forward and drop using the environment variable STAN_NUM_THREADS. Whenever both are provided them they must match.

The idea of num_threads is to set the total available worker threads which are dynamically allocated to chains and/or to threads within chains (giving priority to starting more chains at the same time). I would be surprised if some forking from julia is more efficient, since the intl tbb handles the worker threads using a threadpool, which should be an efficient thing to do here…could you graph the table for easier reading possibly?

Thanks for your response. A few weeks ago I couldn’t get TBB up and running and wondered if that was a MacOS/M1 issue. I’ll go back and try again.

I would also prefer to just use num_threads, in particular in this case, where I’m modifying the STAN_NUM_THREADS environment variable. I’ll check once more if just using num_threads continues to report as num_threads=1 in the chain .csv files.

Attached a very simple plot. X-axis: num_chains, Y-axis: elapsed time.

one_julia_thread

The num_threads issue is fine when just using num_threads on the command line. Not sure what I did wrong early on, but I guess just looked at the wrong .csv files.

I’ll update StanSample.jl accordingly.

As far as TBB is concerned, if I run make clean-all and enable the 3 lines in make/local:

# Enable the MPI backend (requires also setting (replace gcc with clang on Mac)
STAN_MPI=true
CXX=mpicxx
TBB_CXX_TYPE=clang

building cmdstan-2.28.2 fails:


<snipped lots of lines>

common.copy /Users/rob/Projects/StanSupport/cmdstan/stan/lib/stan_math/lib/boost_1.75.0/stage/lib/cmake/boost_wserialization-1.75.0/boost_wserialization-config.cmake
boost-install.generate-cmake-config-version- bin.v2/libs/serialization/build/stage/boost_wserialization-config-version.cmake
common.copy /Users/rob/Projects/StanSupport/cmdstan/stan/lib/stan_math/lib/boost_1.75.0/stage/lib/cmake/boost_wserialization-1.75.0/boost_wserialization-config-version.cmake
boost-install.generate-cmake-variant- bin.v2/libs/serialization/build/clang-darwin-13.0/release/cxxstd-11-iso/threading-multi/visibility-hidden/libboost_wserialization-variant-shared.cmake
common.copy /Users/rob/Projects/StanSupport/cmdstan/stan/lib/stan_math/lib/boost_1.75.0/stage/lib/cmake/boost_wserialization-1.75.0/libboost_wserialization-variant-shared.cmake
...failed updating 1 target...
...skipped 9 targets...
...updated 219 targets...
make: *** [stan/lib/stan_math/lib/boost_1.75.0/stage/lib/libboost_mpi.dylib] Error 1
rob@Rob-16-MBP-2 cmdstan % 

The timing results I’m getting with the updated StanSample.jl:

16×5 DataFrame
 Row │ num_threads  num_chains  num_samples  mean     std      
     │ Int64        Int64       Int64        Float64  Float64  
─────┼─────────────────────────────────────────────────────────
   1 │           1           1        10000  39.607   3.43717
   2 │           1           2         5000  24.2791  2.01681
   3 │           1           4         2500  14.9814  0.891129
   4 │           1           8         1250  10.3624  0.450913
   5 │           2           1        10000  41.2367  3.55019
   6 │           2           2         5000  24.9079  2.68023
   7 │           2           4         2500  19.3467  2.18146
   8 │           2           8         1250  15.184   0.625887
   9 │           4           1        10000  39.6162  5.38162
  10 │           4           2         5000  24.0673  2.91344
  11 │           4           4         2500  16.1526  0.94486
  12 │           4           8         1250  14.1178  0.95978
  13 │           8           1        10000  38.5951  5.02319
  14 │           8           2         5000  24.0275  2.81166
  15 │           8           4         2500  15.9257  1.45381
  16 │           8           8         1250  14.4645  0.606929

A (slightly better) graph based on above results:

timing_results

As I mentioned a few weeks ago I also wanted to take a look using the redcardsstudy. Below are the results for M1/arm.

arm_log_0
arm_log_1

These are the results on my Intel based Macbook Pro (3 years old, 8 cores) using TBB.

intel_tbb_log_0
intel_tbb_log_1

And finally the Intel Macbook Pro (3 years old, 8 cores) results without TBB:

intel_log_0
intel_log_1