New adaptive warmup proposal (looking for feedback)!

I think what you wrote for the cross-chain warmup in MPI is a good enough abstraction there. It doesn’t need to be terribly complicated cause the warmup itself isn’t terribly complicated.

But there is a threading vs. MPI decision that’s happening here and that makes me want to think about the other bits of Stan cause they’re more complicated (the threading solution to cross-chain warmup would also probly be pretty simple).

Right now our warmup is automated in a way that I don’t think this is necessary. It might be though. You could do this with dense vs. diagonal, but I think we have a way to know which would be better before actually collecting the sample (that’s included here and it’s not super robust).

This makes sense in terms of parallelizing calculations in Stan Math at the double level. It’s not obvious to me these things are going to be hugely helpful specifically for cross-chain adaptation or the autodiff stack. Similar to the GPU stuff.

Doing threading with autodiff variables means the autodiff library needs to be threadsafe as well, which wouldn’t be trivial.

Yeah, but we want within chain parallelization too.

If we go TBB for the cross-chain warmup, it’s not obvious to me how the rest of parallelization in Stan works. In the worst case we have threading/GPU (double math) inside MPI/threading (map_rect) inside threading (cross-chain).

I don’t know what implications that has for the autodiff stack, but it sounds complicated. I don’t know if each thread can have its own MPI zone to work in. This seems like it could limit our within-chain scalability.

I do know that if we figure out the dataflow in what a parallel autodiff tree looks like + what Yi proposed for cross-chain warmup, then we don’t have to figure out how to thread autodiff and we know we have scalability.

Anyway I just wanna know how all the MPI stuff works now. There must be some sort of mechanism by which one computer tells other computers to run code. I don’t know how the entry point stuff runs either – when I’ve done MPI before everything runs the same binary. So where do the other processes sit? I assume this is embedded in the language somehow.