Parallelising multipathfinder runs

Hi stan team,

I’m very excited about the work you all are doing with the new pathfinder implementation. I’ve set up a workflow like the one you describe in the Zhang et al (2021) paper, where I use the PSIS draws generated by the pathfinder algorithm to initialise short mcmc chains. The paper notes that pathfinder is embarrassingly parallelisable, presumably in part because seperate runs of multipathfinder can run simultaneously on seperate cores. Is there a way to do this yet? There doesn’t seem to be the ‘cores’ argument that the MCMC sampling functions have yet. But I was wondering if it’s possible to run several instances of single pathfinder (e.g., as seperate jobs on a cluster), saving the single path outputs and then somehow apply PSIS at the end on the combined single path outputs? Failing that, is it appropriate just to use the final draw of each single pathfinder run to initialise the MCMC chains without applying PSIS?


Parallelisation in Pathfinder is built upon the threading structure, so you need to compile the model with the flag stan_threads = TRUE and use the num_threads argument (that’s what it is called in cmdstanr anyway) for specifying the number of parallel pathfinder runs.


Brilliant. That works perfectly. Thanks!

I ran into this exact same problem recently with cmdstanr. There are two open issues that, when fixed, should improve consistency in how parallelization is handled and to warn users who specify num_threads without compiling for multithreading with something very similar to @StaffanBetner’s solution.

1 Like

Hi @StaffanBetner. Thank you for your explanation. Should num_threads correspond to the number of paths (num_paths)?

It should corresponds to the number of parallel pathfinder runs.


To follow up on this thread: what is the backend doing if we have a reduce_sum implementation of a model and we run multiple num_paths of pathfinder on the model?

Will the num_threads be distributed first to pathfinder num_paths, then to reduce_sum threads?

Like with multichain-parallelism, it operates using a central thread pool, where each path is first allocated to a thread and then if the individual paths have some parallelism (reduce_sum) the path will request additional threads from the threadpool

This also means that once a path finishes, the thread(s) it was using are then available to the other paths