New adaptive warmup proposal (looking for feedback)!

The warmup isn’t valid MCMC cause we’re changing the timestep every draw and the metric every window_size draws.

We’re assuming it’s doing well enough to trust the samples to build the metric, but other than that we toss em’. Sampling follows all the MCMC rules so we trust those samples. I think that’s basically it.

3 Likes

This looks really cool! I finally got to install it and throw two models on it. The models are part of my package: https://cran.r-project.org/package=OncoBayes2 I observed that

  • the “combo2” model (you can run it with example_model(“combo2”)) runs better with the vanilla cmdstan - so I am getting higher neff for a standard run (1k warmup, 1k sampling)

  • the combo3 model is runnning somewhat better with campfire warmup in terms of Neff

The runtime in both cases is of the same order… maybe campfire is a bit faster overall. Timing includes here the warmup and sampling time, but not the model setup stuff.

I would need to setup a simulation study to make the above statements more firm. Before doing that I will probably try this thing out on an ODE model which I usually run with short fixed warmup, but any save here is huge… or what else would make sense?

Any other “data” you are looking for?

Can you provide effective sample size per log density evaluation (n_leapfrog)?

Campfire run, combo2:

Inference for Stan model: blrm_exnex_model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

Warmup took (0.085, 0.11, 0.12, 0.080) seconds, 0.39 seconds total
Sampling took (2.0, 1.5, 2.1, 1.9) seconds, 7.6 seconds total

                              Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
lp__                      -5.1e+01  1.2e-01  5.0e+00    -60  -5.1e+01  -4.4e+01  1.8e+03  2.4e+02  1.0e+00
accept_stat__              9.0e-01  4.2e-03  1.7e-01   0.55   9.6e-01   1.0e+00  1.6e+03  2.1e+02  1.0e+00
stepsize__                 1.8e-01  1.2e-02  1.7e-02   0.16   1.9e-01   2.0e-01  2.0e+00  2.6e-01  2.0e+13
treedepth__                4.6e+00  2.2e-01  4.8e-01    4.0   5.0e+00   5.0e+00  4.7e+00  6.2e-01  1.3e+00
n_leapfrog__               2.7e+01  2.8e+00  6.9e+00     15   3.1e+01   3.1e+01  6.0e+00  7.9e-01  1.2e+00
divergent__                5.0e-04      nan  2.2e-02   0.00   0.0e+00   0.0e+00      nan      nan      nan
energy__                   7.7e+01  1.8e-01  7.1e+00     66   7.7e+01   8.9e+01  1.6e+03  2.1e+02  1.0e+00
log_beta_raw[1,1,1]       -5.3e-02  1.3e-02  9.8e-01   -1.6  -5.0e-02   1.6e+00  5.6e+03  7.4e+02  1.0e+00
log_beta_raw[1,1,2]        5.6e-02  1.5e-02  9.8e-01   -1.6   6.1e-02   1.7e+00  4.5e+03  5.9e+02  1.0e+00
log_beta_raw[1,2,1]        1.8e-02  1.3e-02  1.0e+00   -1.6   2.0e-02   1.7e+00  5.6e+03  7.4e+02  1.0e+00
log_beta_raw[1,2,2]        2.9e-04  1.3e-02  9.9e-01   -1.6   7.7e-04   1.6e+00  5.4e+03  7.2e+02  1.0e+00

Combo2, cmdstan run:

Inference for Stan model: blrm_exnex_model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

Warmup took (2.2, 2.1, 2.1, 2.2) seconds, 8.6 seconds total
Sampling took (2.1, 2.1, 1.8, 2.1) seconds, 8.1 seconds total

                              Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
lp__                      -5.2e+01  1.4e-01  5.2e+00    -61  -5.2e+01  -4.4e+01  1.4e+03  1.7e+02  1.0e+00
accept_stat__              9.2e-01  7.1e-03  1.5e-01   0.60   9.7e-01   1.0e+00  4.6e+02  5.7e+01  1.0e+00
stepsize__                 1.5e-01  1.7e-02  2.4e-02   0.12   1.5e-01   1.9e-01  2.0e+00  2.5e-01  3.7e+13
treedepth__                4.9e+00  1.1e-01  2.9e-01    4.0   5.0e+00   5.0e+00  6.7e+00  8.4e-01  1.2e+00
n_leapfrog__               3.0e+01  1.2e+00  3.9e+00     15   3.1e+01   3.1e+01  9.8e+00  1.2e+00  1.1e+00
divergent__                0.0e+00      nan  0.0e+00   0.00   0.0e+00   0.0e+00      nan      nan      nan
energy__                   7.8e+01  2.0e-01  7.3e+00     67   7.7e+01   9.0e+01  1.3e+03  1.6e+02  1.0e+00
log_beta_raw[1,1,1]       -8.1e-02  1.2e-02  9.6e-01   -1.6  -1.0e-01   1.5e+00  6.6e+03  8.2e+02  1.0e+00
log_beta_raw[1,1,2]        9.8e-02  1.2e-02  9.9e-01   -1.5   1.1e-01   1.7e+00  6.6e+03  8.2e+02  1.0e+00

combo3, campfire:

Inference for Stan model: blrm_exnex_model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

Warmup took (0.21, 0.22, 0.22, 0.23) seconds, 0.88 seconds total
Sampling took (3.3, 3.3, 3.3, 3.3) seconds, 13 seconds total

                              Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
lp__                      -7.2e+01  1.8e-01  7.1e+00    -84  -7.1e+01  -6.0e+01  1.6e+03  1.2e+02  1.0e+00
accept_stat__              9.0e-01  2.4e-03  1.3e-01   0.65   9.5e-01   1.0e+00  2.8e+03  2.2e+02  1.0e+00
stepsize__                 4.4e-01  1.6e-02  2.3e-02   0.40   4.5e-01   4.6e-01  2.0e+00  1.5e-01  1.4e+13
treedepth__                4.0e+00      nan  1.6e-02    4.0   4.0e+00   4.0e+00      nan      nan      nan
n_leapfrog__               1.5e+01      nan  1.6e-01     15   1.5e+01   1.5e+01      nan      nan      nan
divergent__                5.0e-04      nan  2.2e-02   0.00   0.0e+00   0.0e+00      nan      nan      nan
energy__                   1.2e+02  2.7e-01  1.0e+01    105   1.2e+02   1.4e+02  1.4e+03  1.1e+02  1.0e+00
log_beta_raw[1,1,1]       -5.1e-02  1.2e-02  1.0e+00   -1.7  -5.8e-02   1.6e+00  6.6e+03  5.0e+02  1.0e+00
log_beta_raw[1,1,2]       -8.5e-03  1.5e-02  9.8e-01   -1.6  -5.8e-03   1.6e+00  4.5e+03  3.4e+02  1.0e+00

combo3, cmdstan

Inference for Stan model: blrm_exnex_model
4 chains: each with iter=(1000,1000,1000,1000); warmup=(0,0,0,0); thin=(1,1,1,1); 4000 iterations saved.

Warmup took (3.8, 3.9, 3.8, 3.8) seconds, 15 seconds total
Sampling took (3.3, 3.3, 3.3, 3.3) seconds, 13 seconds total

                              Mean     MCSE   StdDev     5%       50%       95%    N_Eff  N_Eff/s    R_hat
lp__                      -7.2e+01  1.8e-01  7.1e+00    -84  -7.2e+01  -6.1e+01  1.6e+03  1.2e+02  1.0e+00
accept_stat__              9.0e-01  2.5e-03  1.4e-01   0.61   9.5e-01   1.0e+00  3.3e+03  2.5e+02  1.0e+00
stepsize__                 2.4e-01  1.2e-02  1.7e-02   0.21   2.5e-01   2.6e-01  2.0e+00  1.5e-01  1.8e+13
treedepth__                4.0e+00      nan  5.7e-02    4.0   4.0e+00   4.0e+00      nan      nan      nan
n_leapfrog__               1.5e+01      nan  1.3e+00     15   1.5e+01   1.5e+01      nan      nan      nan
divergent__                1.3e-03      nan  3.5e-02   0.00   0.0e+00   0.0e+00      nan      nan      nan
energy__                   1.2e+02  2.6e-01  1.0e+01    106   1.2e+02   1.4e+02  1.5e+03  1.1e+02  1.0e+00
log_beta_raw[1,1,1]       -2.8e-02  1.3e-02  9.7e-01   -1.6  -4.6e-02   1.5e+00  5.5e+03  4.1e+02  1.0e+00
log_beta_raw[1,1,2]        1.1e-03  1.3e-02  1.0e+00   -1.7   7.4e-04   1.6e+00  6.1e+03  4.6e+02  1.0e+00

Does that help?

1 Like

Yes. Campfire warmup is 20 times faster. The adaptation result is the same.

1 Like

Nonono, that warmup time only includes the final timestep adaptation so that warmup time is misleading. Looks like @wds15’s models are well enough specified that diagonal is good enough.

The only gain to be had would be if the warmup terminates earlier than the default warmup. @wds15, how many windows did it use in the initial warmup periods? Basically how many rows in the tables of numbers that were printing out.

1 Like

Combo3 runtimes:

> camp_run
   user  system elapsed 
 22.212   0.397   9.158 
> cmdstan_run
   user  system elapsed 
 28.619   0.109   7.286 

final campfire warmup message for combo3:

[1] "Adaptive warmup debug info (figure out how many initial windows to drop):"
# A tibble: 2 x 4
  drop_first_n_windows picked max_Rhat min_Bulk_ESS
                 <int> <chr>     <dbl>        <dbl>
1                    0 picked     1.04          381
2                    1 ""         1.07          182
[1] "Metric calculation info (lower c better, last is always picked, minimum is 1.0):"
# A tibble: 7 x 2
  name       c_hybrid
  <chr>         <dbl>
1 dense          7.82
2 hess1_wish     6.85
3 hess2_wish     6.71
4 lw2004         5.65
5 diag           5.47
6 hess1          5.36
7 hess2          5.27

combo2 runtime:

> camp_run
   user  system elapsed 
  9.553   0.322   4.518 
> cmdstan_run
   user  system elapsed 
 28.619   0.109   7.286 

final campfire message for combo2

[1] "Adaptive warmup debug info (figure out how many initial windows to drop):"
# A tibble: 2 x 4
  drop_first_n_windows picked max_Rhat min_Bulk_ESS
                 <int> <chr>     <dbl>        <dbl>
1                    0 ""         1.03          179
2                    1 picked     1.04          234
[1] "Metric calculation info (lower c better, last is always picked, minimum is 1.0):"
# A tibble: 7 x 2
  name       c_hybrid
  <chr>         <dbl>
1 diag          13.7 
2 hess1         13.3 
3 lw2004        11.5 
4 dense         10.9 
5 hess1_wish    10.7 
6 hess2_wish     9.74
7 hess2          9.21
2 Likes

Oh okay so your models are well parameterized enough that this mechanism decides they only need 200 iterations of warmup each (there are only two rows in the Adaptive warmup debug info table and window_size by default is 100).

You might try window_size = 50 and see if you can go even lower, but we’re computing Rhats on these windows so you don’t want them too small (in the Rhat calculations we’ll split them in half again and compute marginal variance estimates).

@seantalts, @wds15, @yizhang anyone got time to go through with me how MPI currently works in Stan?

I was talking to Yi and trying to think how MPI would work with this, but I’m distracted thinking how MPI works with Math at all.

I want to know how MPI works because I would like to write down what it would take to do MPI in the chain() part of autodiff.

Actually @Stevo15025 did you work on this? Including you here too!

I know a little bit, need to review some other code in math right now then I’ll have a look. I think this is a very nice fit in the multi chain stuff we have been talking about lately

Mpi does not work at the moment on windows since we haven’t figured out the build stuff there yet.

A feature like parallel warmup should by all means work on all platforms in an easy to use setup and should not rely on a hard to setup thing like mpi. In short, I strongly suggest to do a thread based intel tbb implementation first…or mpi support needs to be improved substantially first (supporting windows and making mpi installs easy for all users on all platforms).

Oh this is the boost MPI thing, right?

I’m probably not properly appreciating the Cran + Windows combo, but this hasn’t been my experience with MPI in Linux. Though I suppose TBB is easier in that it’s just a shared library.

I think architecturally a simple, threaded reverse mode would probably look pretty similar in terms of where data needs to go and what interactions we need with the compiler. They would differ in how the autodiff stack is managed for sure.

I want to talk about each worker maintaining their own autodiff stacks which the autodiff stack on a manager node would reference. If the specific effort I make gets thrown away, that’s okay. The upside for MPI is higher, right?

I was handwaving mpi to just mean parallel here, but yeah we should use the TBB.

I skimmed over the R code for your stuff. I think if we write it up well it should be pretty easy to put parallel stuff in it

If you are planning for MPI cross-chain warmup I suggest you design from scratch(which I’ve done but new ideas always help), instead of looking at how it works in math and try to adapt to it. Because I believe it should be the math’s design follow what you want in chain level(and refactor the math if necessary), not the other way around. I’d be happy to discuss it.

One thing I was wondering that I think is semi-related, have we ever thought about doing something like model based optimization (like what mlrMBO does)? i.e. start N chains up for 100 steps with different settings, build a quick dirty little model to predict ess and rhat given the tuning parameters and then have the next 100 adaptation steps start with some variants of the settings we think will give the max ess and min rhat?

I believe that with my new knowledge about building shared libraries on Windows from the TBB ride, one could probably reattempt to get things working on windows…it’s just that I never did setup MPI on Windows.

It’s a lot more than the build convenience (which is already big). With the TBB you have a huge library available to use for parallelizing your code based on pipelines, paralllel for / reduce, graph flow stuff… it’s really a rich framework. With MPI you are very much down to super low level programming. Basically only the communication is handled for you in a C-like programming scheme.

Why? We will never run much more than 8 chains maybe… and we have 8 cores on many machines nowadays. Certainly, if you do more than just the chain-parallelization, then MPI is nice to have as the higher level mechanism to link together different machines (and on the machines you do threading).

I think what you wrote for the cross-chain warmup in MPI is a good enough abstraction there. It doesn’t need to be terribly complicated cause the warmup itself isn’t terribly complicated.

But there is a threading vs. MPI decision that’s happening here and that makes me want to think about the other bits of Stan cause they’re more complicated (the threading solution to cross-chain warmup would also probly be pretty simple).

Right now our warmup is automated in a way that I don’t think this is necessary. It might be though. You could do this with dense vs. diagonal, but I think we have a way to know which would be better before actually collecting the sample (that’s included here and it’s not super robust).

This makes sense in terms of parallelizing calculations in Stan Math at the double level. It’s not obvious to me these things are going to be hugely helpful specifically for cross-chain adaptation or the autodiff stack. Similar to the GPU stuff.

Doing threading with autodiff variables means the autodiff library needs to be threadsafe as well, which wouldn’t be trivial.

Yeah, but we want within chain parallelization too.

If we go TBB for the cross-chain warmup, it’s not obvious to me how the rest of parallelization in Stan works. In the worst case we have threading/GPU (double math) inside MPI/threading (map_rect) inside threading (cross-chain).

I don’t know what implications that has for the autodiff stack, but it sounds complicated. I don’t know if each thread can have its own MPI zone to work in. This seems like it could limit our within-chain scalability.

I do know that if we figure out the dataflow in what a parallel autodiff tree looks like + what Yi proposed for cross-chain warmup, then we don’t have to figure out how to thread autodiff and we know we have scalability.

Anyway I just wanna know how all the MPI stuff works now. There must be some sort of mechanism by which one computer tells other computers to run code. I don’t know how the entry point stuff runs either – when I’ve done MPI before everything runs the same binary. So where do the other processes sit? I assume this is embedded in the language somehow.

Yep…thats not trivial at all. This is why my proposal for parallelism included stan math functions which basically wrap parallel map and reduce from the tbb in a safe way for reverse mode autodiff.

Even if you do 8 chains with threads, then you will still have cores left for within chain parallelization on modern machines. My main point about going all threads is feasibility. We will get a much quicker implementation going if we focus on threads only…and it will be enormously useful for many users and there wont be any installation hiccups (even our cluster has from time to time mpi conf issues).

Mpi is a message passing interface…not more!

Anyway - what I have done for mpi in stan math is very straightforward and simple. You have one root node which distributes work to all workers. The workers are by default in a listening mode for command objects when they do not work on some command already. The root node sends a type id to the workers which are in listening mode. The type id must refer to a command struct which has a distributed_apply method. This method is then called on all workers. At this point the program execution flow is on all childs in the respective distributed_apply of the respective command object and the root knows this such that it can initiate the work which is passed to the workers. Everything must stay synchronized when doing this!

Have a look at the unit test test math prim arr mpi_cluster_test.cpp. There you have a simple example. The mpi_parallel_call_test.cpp is in the mat branch and is more complex as it uses the mpi_parallel_call facility.

I hope this helps you to get started. Be warned that I really designed this thing to do exactly what map rect does - chunk work in equal sizes from a root and send it to all workers … no scheduling, no nothing.

Excellent. Had a look at that and some stuff that Sean sent me. That is what I was looking for. Thanks