Actually @stevebronder did you work on this? Including you here too!
I know a little bit, need to review some other code in math right now then I’ll have a look. I think this is a very nice fit in the multi chain stuff we have been talking about lately
Mpi does not work at the moment on windows since we haven’t figured out the build stuff there yet.
A feature like parallel warmup should by all means work on all platforms in an easy to use setup and should not rely on a hard to setup thing like mpi. In short, I strongly suggest to do a thread based intel tbb implementation first…or mpi support needs to be improved substantially first (supporting windows and making mpi installs easy for all users on all platforms).
Oh this is the boost MPI thing, right?
I’m probably not properly appreciating the Cran + Windows combo, but this hasn’t been my experience with MPI in Linux. Though I suppose TBB is easier in that it’s just a shared library.
I think architecturally a simple, threaded reverse mode would probably look pretty similar in terms of where data needs to go and what interactions we need with the compiler. They would differ in how the autodiff stack is managed for sure.
I want to talk about each worker maintaining their own autodiff stacks which the autodiff stack on a manager node would reference. If the specific effort I make gets thrown away, that’s okay. The upside for MPI is higher, right?
I was handwaving mpi to just mean parallel here, but yeah we should use the TBB.
I skimmed over the R code for your stuff. I think if we write it up well it should be pretty easy to put parallel stuff in it
If you are planning for MPI cross-chain warmup I suggest you design from scratch(which I’ve done but new ideas always help), instead of looking at how it works in math and try to adapt to it. Because I believe it should be the math’s design follow what you want in chain level(and refactor the math if necessary), not the other way around. I’d be happy to discuss it.
One thing I was wondering that I think is semi-related, have we ever thought about doing something like model based optimization (like what mlrMBO does)? i.e. start N chains up for 100 steps with different settings, build a quick dirty little model to predict ess and rhat given the tuning parameters and then have the next 100 adaptation steps start with some variants of the settings we think will give the max ess and min rhat?
I believe that with my new knowledge about building shared libraries on Windows from the TBB ride, one could probably reattempt to get things working on windows…it’s just that I never did setup MPI on Windows.
It’s a lot more than the build convenience (which is already big). With the TBB you have a huge library available to use for parallelizing your code based on pipelines, paralllel for / reduce, graph flow stuff… it’s really a rich framework. With MPI you are very much down to super low level programming. Basically only the communication is handled for you in a C-like programming scheme.
Why? We will never run much more than 8 chains maybe… and we have 8 cores on many machines nowadays. Certainly, if you do more than just the chain-parallelization, then MPI is nice to have as the higher level mechanism to link together different machines (and on the machines you do threading).
I think what you wrote for the cross-chain warmup in MPI is a good enough abstraction there. It doesn’t need to be terribly complicated cause the warmup itself isn’t terribly complicated.
But there is a threading vs. MPI decision that’s happening here and that makes me want to think about the other bits of Stan cause they’re more complicated (the threading solution to cross-chain warmup would also probly be pretty simple).
Right now our warmup is automated in a way that I don’t think this is necessary. It might be though. You could do this with dense vs. diagonal, but I think we have a way to know which would be better before actually collecting the sample (that’s included here and it’s not super robust).
This makes sense in terms of parallelizing calculations in Stan Math at the double level. It’s not obvious to me these things are going to be hugely helpful specifically for cross-chain adaptation or the autodiff stack. Similar to the GPU stuff.
Doing threading with autodiff variables means the autodiff library needs to be threadsafe as well, which wouldn’t be trivial.
Yeah, but we want within chain parallelization too.
If we go TBB for the cross-chain warmup, it’s not obvious to me how the rest of parallelization in Stan works. In the worst case we have threading/GPU (double math) inside MPI/threading (map_rect) inside threading (cross-chain).
I don’t know what implications that has for the autodiff stack, but it sounds complicated. I don’t know if each thread can have its own MPI zone to work in. This seems like it could limit our within-chain scalability.
I do know that if we figure out the dataflow in what a parallel autodiff tree looks like + what Yi proposed for cross-chain warmup, then we don’t have to figure out how to thread autodiff and we know we have scalability.
Anyway I just wanna know how all the MPI stuff works now. There must be some sort of mechanism by which one computer tells other computers to run code. I don’t know how the entry point stuff runs either – when I’ve done MPI before everything runs the same binary. So where do the other processes sit? I assume this is embedded in the language somehow.
Yep…thats not trivial at all. This is why my proposal for parallelism included stan math functions which basically wrap parallel map and reduce from the tbb in a safe way for reverse mode autodiff.
Even if you do 8 chains with threads, then you will still have cores left for within chain parallelization on modern machines. My main point about going all threads is feasibility. We will get a much quicker implementation going if we focus on threads only…and it will be enormously useful for many users and there wont be any installation hiccups (even our cluster has from time to time mpi conf issues).
Mpi is a message passing interface…not more!
Anyway - what I have done for mpi in stan math is very straightforward and simple. You have one root node which distributes work to all workers. The workers are by default in a listening mode for command objects when they do not work on some command already. The root node sends a type id to the workers which are in listening mode. The type id must refer to a command struct which has a distributed_apply method. This method is then called on all workers. At this point the program execution flow is on all childs in the respective distributed_apply of the respective command object and the root knows this such that it can initiate the work which is passed to the workers. Everything must stay synchronized when doing this!
Have a look at the unit test test math prim arr mpi_cluster_test.cpp. There you have a simple example. The mpi_parallel_call_test.cpp is in the mat branch and is more complex as it uses the mpi_parallel_call facility.
I hope this helps you to get started. Be warned that I really designed this thing to do exactly what map rect does - chunk work in equal sizes from a root and send it to all workers … no scheduling, no nothing.
Excellent. Had a look at that and some stuff that Sean sent me. That is what I was looking for. Thanks
Here are some (unverified) thoughts I had when reading through the opening post:
Fixed size Windows are great, this probably solves the problem that sometimes the window size grows too fast (running large windows with a poorly adapted sampler takes very long and is unnecessary, it would be better to first adapt with short windows before running longer ones).
I think this adaptation routine will have problems with multimodal posteriors with strongly separated modes, as well as with phase changes (not sure about the terminology with respect to phase changes). It will probably tend to adapt to a local mode/phase and stop adaptation prematurely of forget important previous adaptation samples if the frequency of mode/phase switching is so low that it doesn’t always occur within each warmup window. Iif the posterior is suspected to exhibit this problem, then (very!) large windows must be chosen to ensure global adaptaion.
I think merging warmup between chains is probably best kept optional, as differences in warmup between chains can be useful for assessing the reliability of the inference.
About using effective sample size to decide about window length and about how much warmup is enougth, and its interaction with the mass matrix:
I think I read that nEff is mostly about the sample size with respect to calculating means or medians from the samples. What if the user is interested in the more extreme parts of the distribution, such as when computing credible intervals, or if he is interested in the correlations between parameters?
I suspect that the computation of nEff depends on the assumption that the correlation structure stays the same throught the sample. This is not the case if the sample spans several adaptation windows. Thus, I think a modified version of nEff is needed, which computes the ratio of nEff/n for each window separately and then uses the lowest of these ratios to estimate a “worst case nEff” for the entire multi-window period. Alternatively nEff could be calculated for each window separately, and they are summed when windows are merged.
The samples of different windows should be weighed (or thinned) according to the windows nEff/n ratio when calculating the metric (not forgetting that the computation of the metric employs regularisation), otherwise the “good” parts get overwhelmed by the bad ones and the metric ends up worse than the nEff calculation would suggest.
I’m not sure how the calculation of R^ within the warmup phase would be affected by this warmup routine.
Thanks @Raoul-Kima for your thoughts. We have had similar thoughts many parts and below are comments where I differ
We are aware of the potential problems and fixes, but just didn’t have them yet here. We have diagnostic for the multimodality and if detected we can warn the user and let the user decide what to do. Strongly separated modes are problematic beyond this adaptation so this adaptation conditional on multimodal diagnostic is not making anything worse. Right now we recommend to test only with posterior with not strongly separated modes. We know the challenges also in case of not strongly separated modes. We are happy to see examples of problematic cases.
I’m not aware of anyone regularly looking at the differences in warmup between chains. Usually people look only differences after warmup between chains. It is not clear that just haring the information about scale of the posterior or local curvature of the posterior would make the chains during the warmup to resemble each other too much so that in case of problems we would falsely assume mixing is good. We are happy to see examples of problematic cases.
We are using both Bulk-ESS and Tail-ESS https://arxiv.org/abs/1903.08008
Although currently we are examining only lp__ during the warmup, as computing Rhats and ESSs for thousands or millions of parameters can take a lot of time. We could add an option that the use can decide which variables are analysed.
It’s just a warmup where we hope to get some speed-up, it’s not yet the actual sampling and more esoteric diagnostics can be left for that part.
We are aware of this and could take it into account. The good thing is that this is still just for the warm-up and small reduction in efficiency in favor of simplicity might not be a problem.
The bad ones should not be included at all in the approach.
I’m not sure where “this” refers? To our original suggestion or what you write in the last paragraph?
@Raoul-Kima Oh crap, my brain has been nuked by the US elections and I forgot to respond.
@avehtari thanks for grabbing this.
Yeah, but we don’t really expect NUTS to work well with multimodal posteriors or things where curvature is changing anyway. In these cases I don’t really expect there to be a right Euclidean metric to pick!
I’m with Aki on this stuff. It’s not that the bad stuff never happened before in warmup. And it’s not like we were really checking before. Now we’re checking for it and trying to compensate.
It’s entirely possible that we hit models that are more fragile because of this, but that’s not what I’ve seen. In my small amount of experience, the accel_spline example in the original post adapts more reliably with this new stuff than the old stuff (I was gonna give details but it’s been so long since I’ve done the experiment I’m scared I’ll get them wrong – but I think this is true for the campfire implementation – still working on the MPI version).
I think a lot of what we’re doing here is automating what we recommend people do for robustness – checking Rhats and Neffs.
These are annoying to check especially with lots of parameters so I’m pretty stoked about that, at the very least.
Thanks for the Answers. Sounds great. Below I’ll only answer points where I have something to add.
Yeah, differences after warmup, caused by warmup. That’s what I meant. I don’t have a good example at the moment, which supports your view that’s it’s not very important.
I don’t think there is a clear line between good and bad, as every adaptation window will be slightly different. But maybe it’s clearer than I thought and/or not a problem in practice.
Not really anything specific. I just wanted to say that i didn’t spend any time thinking about how R^ might work with the multi-window thing, I only thought about nEff. I just mentioned it because I thought similar considerations probably apply there.
nEff (or ESS) is not independent of Rhat as the computation uses the subcomponents of Rhat
So how is everyone feeling about campfire these days? The recent post inquiring about general recommendations for setting warmup duration for very-long-compute-time models reminded me about campfire and it’s seeming suitability for that case if we haven’t encountered any major issues with it yet.
Still in evaluation. @yizhang did the legwork to get most of it implemented with MPI in cmdstan a couple months ago, but I don’t think either of us have messed with it in a while: Cross-chain warmup adaptation using MPI
My plan was to add it to torsten and try some PMX applications but now with school closed kids take priority over codes.