Documenting / rethinking environment var STAN_NUM_THREADS

I’m spinning this off from discussion Help with naming threading argument instead of just hijacking that thread -

Agreed, the current way of doing things seems ad hoc, and it would be preferable to set this up via the services interface - how would this change the initialization routines?

A quick grep of the math lib shows that the STAN_NUM_THREADS environment variable gets picked up here: stan/math/prim/core/init_threadpool_tbb.hpp, and the constraints are:

 * - STAN_NUM_THREADS is not defined => num_threads=1
 * - STAN_NUM_THREADS is positive => num_threads is set to the
 *   specified number
 * - STAN_NUM_THREADS is set to -1 => num_threads is the number of
 *   available cores on the machine
 *   not numeric => throws an exception

For now, this needs to be added to the CmdStan documentation. It’s not in the discussion of parallelization in the Stan User’s Guide - perhaps this is too implementation-specific a detail? We need more user-facing documentation - not sure where it should go.

Just for the record: Originally threading was a mere optional thing and the idea was to follow OpenMP conventions which works via similarly named environment variables.

At a later stage we introduced the TBB and going through their docs suggests that it is a bad idea to actually limit the number of threads being used, which is in the sense of a large threadpool, of course. Moreover, their doc also suggests to control the concurrency level through the program itself rather than any environment variable or anything else.

Another upside of having the variable was (and is) that this way threading for the purpose of gradient evaluation was fully transparent for the services and all the interfaces. If we now plumb the concept of threads into the services and hence the interfaces, then threads are probably not any more optional - or at least it would be big call to have all that threading code being optional.

Finally… it smells like a lot of work to make threads an integral part of the services… maybe it‘s not as much, but I fear it is which makes it a resource problem.

I hope this helps a bit to go forward here.

EDIT: A good first application and reason for threading to be part of the services layer would have been the shared warmup thing, I thought originally… but that work is a bit on hold for now and anyway explored things with MPI. It would still be useful to have the services handle something like „run 4 chains with a threadpool of size 10“. The TBB would then be able to distribute the work and slower chains can benefit from additionally freed ressources from faster chains - at least that‘s the hope.

We could also just make an argument of cmdstan under sample that would be used in the cmdstan in the init of the threadpool:

This would avoid going in to the services level. But thats sounds hacky.

I understand that defining the number of threads is not what most CPU parallelization frameworks make huge efforts of exposing. But to me thats mostly because they focus on “max power”. While we also want max power, it would be nice to be user friendly for those that want parallelism and being able to work on other stuff on their laptop for example.

Also an issue for that is already open:

1 Like

I think using environment variables is cumbersome and not what R/python users are accustomed to. And that is who are audience is.

I also understand why it was done this way. Partly that its used in parallel frameworks sometimes and partly because it was easier for testing/developing in Math because of the 3 repo structure. The latter is also why the Opencl context is initialized the way it is, which could be done in a simpler way if developed as a Stan-first and not Math-first component.

EDit: its also why MPI caching implementation is fairly complex. One of the reasons I am pushing for a monorepo.


I agree with all of Rok’s points - explaining environment variables to users is sub-optimal, and services layer is not the right place for this either. Agree that it should be passed in to CmdStan via a command line argument, and command.hpp should do the right thing.

please forgive my ignorance here - are GPU and TBB different alternatives? can you have TBB threads on a GPU?

1 Like

Not alternatives. You can have both. TBB for cpu parallelism and opencl for gpu parallelism.

Just add nthread argument to cmdstan that gets passed to tbb::task_scheduler_init. Without user input tbb defaults to auto mode(number of logical cores - 1).


For the record NUM_THREADS is no longer popular in OpenMP either, nowadays people use omp_get_num_procs().