Since cmdstanr by now can read and set step size and the inverse mass matrix for stan fits, we can now do a shared warmup thing in a lame version. What I am thinking about is how can we get better ESS / time for a fixed resource budget with 4 cores using threading. Right now we would by default run 4x chains with full warmup and 4x full sampling… but with threading we have the freedom to distribute resources differently. Since within-chain parallelization is less efficient than between-chain parallelization its obvious that we should continue running 4 chains during the sampling phase - but not so during warmup. Thus the scheme I am trying out is
warmup with 4 threads for a single chain
readout the step size, mass matrix and 4 points from the last phase of warmup
fire off 4 chains to do sampling
As it turns out, it is actually even beneficial to run the threaded warmup even somewhat longer in order to get a better estimate of the tuning parameters. So what I am comparing now is
“standard” 4 x in parallel do 1000 warmup & 1000 sampling
“threaded” 1x 3000 warmup with 4 threads and then 4x 1000 sampling (using the same step size and inverse mass matrix, but different initials)
The “threaded” version seems to give a 10% larger ESS / time in the example I picked. Admittedly, the example I choose is relatively efficient in speeding up with more cores; so that speedup does not translate 1-1 to other models.
Still, this could be an interesting way to sample and it would be nice if our interfaces would make it easier to do this. For example - @paul.buerkner - would it be possible if the update method takes over the tuning parameters step size and inverse metric from a previous fit?
Only cmdstanr can start a chain with a given inverse metric. I think rstan can only read it out, but not start sampling with it being provided (at the moment).
So this would be for now a cmdstanr only feature if that’s feasible for brms.
That is feasible. Is there an example how to restart a chain with a given inverse metric? Since this warmup via threading feature requires a little more than just running update, perhaps it makes sense to open an issue for brms in which we discuss all the required details.
I might be out of my depth here, but this seems potentially risky unless there’s a good way to assess the convergence of the warmup chain prior to initiating the sampling, especially where multimodality is a possibility. Relatedly, I guess it might be important to double-check that the autocorrelation in the warmup chain is low enough that you’re not starting all four chains from adjacent points in the posterior, irrespective of convergence (maybe this is essentially guaranteed by NUTS as long as the max treedepth isn’t being exceeded?).
If these checks can be performed and met, does it not suggest that an efficient way to use many cores (say 100) would be to perform multi-threaded warmup on one (or several) chains, then run them for 100 iterations, and then fire off 100 independent chains to sample for a few iterations each? Importantly, these chains would require no further communication and could therefore be distributed across multiple computers wherever there are idle cores sitting around. If that strikes you as acceptable statistical practice, I might try to do that myself for some very long compute-time models that I’m grappling with.
The strategy to use fewer independent chains for warmup bears some risk, of course. It makes the warmup less robust to some extent, sure. I haven’t seen many issues with NUTS and autocorrelations - and you can actually take samples towards the end of the warmup which should already be taken from the posterior itself such that they make good starting points for a following sampling phase.
From playing a bit with this, I think that the approach to run few (or just 1) chain for warmup with many cores and then start independent sampling for each core is only useful if your model requires you to run as short as possible warmups. It seems to better to me to run one chain somewhat longer iteration wise for warmup which is made possible by shortening the runtime with more resources.
All of these thoughts is not for large CPU count being available, but rather only very limited resources like 4 cores. If you have 100 cores, then maybe just use 4 cores per chain and start 25 of these… done.
Would it make any sense to first do a single chain of multi-core warmup with lots of iterations as the OP proposes, then do multiple chains of single-core warmup with very small iterations but starting from different initializations, and use the history of the multi-core warmup to discern whether the single-core warmups are heading to the same state as the multi-core (suggesting that the multi-core is not initialization-influenced)? The number of iterations in the single-core warmups go could even be informed by the history of the multi-core warmup (i.e. more iterationsif the multi-core suggests a noisier warmup process; might be betraying my ignorance of what actually happens during warmup on this last bit though).
(And if all is good in warmup, finally sampling with multiple parallel single-core chains to achieve the desired ESS)