What is the main bottleneck in parallelizing Stan?

Dear Stan Community,

Just out of curiosity, what is the showstopper when it comes to parallelizing Stan? Say, running it in the cloud on 5000 cores? Or on 50 GPUs ?

Would it not be “enough” to “just” start more chains? Independently ?

(Ignoring problems with mixture models and related landscape symmetry issues, as in, for example, 2D Ising Model that is below the critical temperature where it is possible / inevitable to get “stuck” in one half of the phase space - in that case exploring the whole space - in thermodynamic limit - will take forever - but even in large systems it is exponentially unlikely as a function of the number of spins).

Or is the calculation of the gradients the showstopper ?

Why would the, “starting 1000 chains” approach not work? I wonder.

Cheers,

Jozsef

There’s a discussion on this over here

As I understand it, parallelism has taken great strides in the last year or so on three fronts: GPU, MPI and threading.

The GPU stuff requires >=1 gpu per chain, so that means either sampling chains in serial or shelling out for multiple GPUs (which may be tenable for some, esp if renting on a cloud compute service), and thanks to overheads the GPU accelerations are only worth it for certain kinds of model/data combos (i.e. large matrix operations). But for models like Gaussian Processes, it’s really exciting.

I haven’t looked into the threading or MPI stuff much, but gather that threading is for accelerating on a single local machine with >=2 physical-cores-per-chain available (though has anyone checked whether hyper-threading helps at all for the 1 physical-core-per-chain case?), while MPI is for working on a cluster.

There’s also some pretty important speedups coming in the form of parallel warmup so that separate chains can share info about the geometry of the problem, allowing all chains to get to sampling sooner.

If you don’t bother with any of the above and just throw lots of chains on lots of cloud cores, you definitely don’t want to try to get lots of effective samples per chain; some might naively think that more samples is better, but after a few thousand effective samples you’re really not gaining much inferentially and just giving yourself lots of post-processing headaches. Sometimes folks have models that sample inefficiently and try to overcome this by just grabbing lots of samples, but an inefficiently sampling model is usually a sign that something’s wrong and the model’s structure/parameterization needs to be re-considered. With an efficiently sampling model, you could throw it on lots of independent cores and only grab a few samples per chain after warmup, but you still have to go through the entire warmup on each chain, which is pretty wasteful.

I have four cores with hyperthreading and running 5 processes in things like make seems about optimal. So nothing like doubling. I believe the same logic will hold for MPI.

At the same time, using map_rect without MPI or threading can help in a single core by doing partial evaluation of partials on the fly, which reduces memory footprint and increases locality so speed is also helped.

@betanalpha and @yuling has been schooling us all on how that leads to bias; see the discussion @stevebronder linked above.

We don’t have any concrete plans along these lines, but I think this has a lot of potential. Once adaptation has converged and we have one effective draw not biased more than epsilon by the starting point (again, see @betanalpha’s discussion linked above), we have embarassingly parallel sampling using multiple independent chains.

1 Like

Oh! I saw mention of this somewhere and it sounds super worthwhile; I definitely need to dive into map_rect now. Are there any demos anywhere benchmarking the kinds of speedups to expect?

Guess I need to get on the school bus too!

It really depends on how compute intensive the shards are. When they’re solving differential equations, the speedup’s pretty close to the number of cores. When they’re doing something simple like matrix-vector multiplication or they require lots of packing/unpacking and communication of parameters, there’s not nearly the same speedup.

Aside from the user’s guide, @richard_mcelreath wrote a nice tutorial.

2 Likes

Sorry for being so late to the party, there are now few sudden personal “things” that have just happened two days ago, so I cannot carefully read and comment on your kind responses, please give me a week, thank you for your responses, I get back to this thread once the personal issues have settled. Kind Regards, Jozsef

1 Like