Scaling the multithreading

So I think Sebastian implemented a TBB multithreaded version of base_nuts.cpp, I’m assuming it’s implemented correctly, but I want to see how it scales on the pre-existing models. It’s been a while, but I remember opening a PR we’d be able to see the performance improvements on the existing model database. I have a PR opened, anyone know how I can get it to run the multithreaded version of NUTS on the existing model database?

Here’s the link to the PR. This isn’t meant to be merged, just benchmarking:

I think was more that he built a multi-threaded version of running multiple chains rather than spinning up multiple processes. The main advantage of that is that you can share memory for the data and you can manage communication at the thread level. I wouldn’t expect this to show much difference in performance, if any, because Stan’s bound by cache pressure more than it is by thread/process synchronization.

The cache/memory issue on an MC algorithm is actually why I got laid off…

Sure, I guess for a RNG you can’t really parallelize it. You can parallelize chains. So if you’re saying there’s no benefit in multi-threading stan’s current HMC (NUTS), then I won’t proceed.

But there’s parallelization in matrix decomposition, such as in Cholesky decomposition, which already has examples, if that’s advantageous or with contribution. If that’s of interest, I could dig through the math library to see what I can do. I’m sure there’s a lot of literature on parallelization of matrix factorizations. Or may be boost has already implemented this?

And is there any possible ways of threading Stan’s current HMC algorithm? Is it worth looking into?

I can look into go the stan library and algorithms and see what I can add. Anyone with more expertise, I’m open to recommendations.

It depends. JAX has a built-in parallelization where they can spawn new seeds essentially from an RNG state and generate a bunch of new RNGs that can be used in parallel. We do that in Stan to advance the RNG about 1 trillion draws for each chain (you need the right kind of RNG for that)

There’s a small benefit—the memory can be less because you don’t need to duplicate data. Otherwise, you’re not going to see much of a difference with running in multiple processes.

We do have some functions that have multi-threaded implementations. I think those are only going to be useful for performance if you have a lot more cores than you have chains.

I’m not sure what you mean. There’s an advantage in running in multiple threads if you want to share information across chains, such as doing automatic stopping. That’s what I’m doing in the new Walnuts implementation. I was just saying that the advantage of running multiple chains in multiple threads over running multiple chains in multiple processes is minimal unless you’re memory constrained.

Anything that involves tearing open our inference algorithms as written in Stan is going to be painful for both development, testing, and getting changes approved because we’re super conservative about changes to the sampling algorithms. This is why I’m developing Walnuts outside of the scope of Stan.