Cross-chain warmup adaptation using MPI

Sorry, I meant to write make clean-all. It has failed recently twice for me.

Yuck. I just modified src/cmdstan/arguments/mpi_cross_chain_set_output.hpp and the makefile failed to rebuild a model, so maybe there’s something gone awry in the makefile.

It’s more than just naming output but also the way adaptation works. If MPI_ADAPTED_WARMUP is on sampler will use cross-chain even with only one proc. We can either error out when nproc < 4 or do more if else on nproc == 1.

Good point. I guess it would be ideal if it would just fall back to the old behavior for now. Is that possible in the current code base? If so, what is the ‘if’ statement I should use?

Let me handle that. Currently I’m testing it by simply building regular binary by removing MPI_ADAPTED_WARMUP, so that I have model_name_seq and model_name_mpi for benchmark purpose. It also lower the chance regular runs get messed up when I fuck up.

Cool beans, thanks

What’s the syntax for supplying non-default values to the sampler? Like adapt_delta and max_treedepth?

It’s bit complicated. See details in Section 9 of https://github.com/stan-dev/cmdstan/releases/download/v2.21.0/cmdstan-guide-2.21.0.pdf

I think @mitzimorris is working on making the syntax easier.

1 Like

Thanks! I should have thought to look at the pdf (was lost googling instead).

For adapt_delta and max_treedepth specifically, I see now that one does:

system(paste(“mpiexec -n 4 --tag-output ./”, modelname, " sample save_warmup=1 algorithm=hmc engine=nuts max_depth=15 adapt delta=.99 data file=", datapath,sep=""))

1 Like

Ha! I missed the name before, but I love it. And it provides all sorts of grist for naming warmup algorithms: mittens, pot-bellied stove, hot bath, pot of tea, etc.

Is it actually possible to run this new warmup in a non-adaptive mode (either directly or via a workaround)?

I mean, can I force the warmup to run for a fixed number of iterations? Would I still benefit in some way from the cross-chain communication?

I think doing so would be useful to understand what “cross-chain” brings in terms of efficiency on its own; or is this irrelevant?

Sharing information makes the mass matrix adaptation faster, which can help reduce the number of leapfrog steps during warmup and and may produce better mass matrix also in fixed number of warmup iterations.

Sure…that’s clear, but does it really matter? What problems benefit from it and how much do we gain in efficiency - that’s what I am wondering about. There is usually a price for added complexity.

I do not really know how best to quantify these gains. Maybe reduced numerical cost on average which is hard to estimate given that the set to average over is not really small. The metrics you proposed are certainly a start if we collect them on some examples.

What about the hurdle rate? (The number of warmup steps needed to get back an ESS of size N)

We are all wondering the same, and that’s why we are making experiments and asking people to try. We know better in few months.

When this thread started I erroneously thought Yi’s branch already included all of campfire, but it’s not yet, so the results are not yet that impressive and you may want to wait until there’s more of campfire before you start testing. The current branch has been useful already for testing and development.

See my R script, and report the values for your favorite slow to warmup problem. I can then interpret the results. See also below.

Before we start to make decision when or never to add this to a Stan release, we will have results for 100+ posteriors with different levels of campfire enabled. This will take several months. We will make it easier for others to test and get more easily interpreted output.

If the results are not impressive we don’t add it.

ESS (biased) for estimating the mass matrix or for actual sampling? This is difficult when we run this for hundreds of different posteriors.

The efficiency measures I’m looking for are listed in Cross-chain warmup adaptation using MPI - #39 by avehtari. When running over hundreds of posteriors we would be interested that there are 1) not much worse performance for any posterior, 2) on average improvement, and 3) certain types of posteriors with much improved performance.

In all honesty…Stan handles warmup usually good for my models…it’s just slow. So getting a warmup which is as robust and stable as the current one, but just faster would be great. Though sometimes warmup is suboptimal as chains end up with differing warmup tuning parameters (but that is rarely the case).

I also did not intend to make judgements on wether we include it or not - we are not there yet. I do wish to understand what each bit you propose is adding here. For one I want to know that and for two one can potentially bring some features earlier to our users than others.

It is clear that what you do can have a big, positive impact, so I am basically looking for low hanging fruits here.

Is there already a central place where we collect models and model results? A wiki page or something similar?

I tested a while ago campfire on a model I am interested in. Should I apply the mpi thing to this model? Would that add value? Probably yes as I can keep this setup around and rerun once every now and then when you bump features.

We collect posteriors (model + data) at GitHub - stan-dev/posteriordb: Database with posteriors of interest for Bayesian inference
It’s specifically posteriors as the same model can have very different posterior depending on the data. The classic examples are hierarchical models with centered or non-centered parameterization, where the posterior has funnel shape depending on the model and the data. Please free to add issues or make PRs to add interesting examples.

We don’t yet have a central place for collecting campfire results for different posteriors. I guess we are still in the phase of figuring out the compact summary of the results we would like to collect. We would prefer that all the best posterior examples would be included in posteriordb, so that re-producing results could be made automatically by pulling out models and data from posteriordb and then the central place is posteriordb + a script in github + auto-generated result output maybe in md-format also in git?

1 Like

How about we code up some R scripts which can evaluate Stan model databases (like posterior db and our example model repository) in an automated way? This should be implemented on the basis of a cluster-computing aware backend like the batchtools package. Then we could run large scale, fully automated numerical studies on clusters (with all the whistles and bells one likes to have like some form of replication with different seeds, etc.).

Is it worth at this point to invest this effort?

(this probably useful regardless of the specifics of warmup studies)

This is our plan with posteriordb. We have been thinking whether the testing part would be part of posteriordb or separate package bayesbench, but no final decision has been made yet. posteriordb repo has instructions how to access list of posteriors, each model+data combination with golden standard estimates. @bbbales2 has used posteriordb to test warmup, but not yet in cluster. If you have suggestions for cluster aware scripts, please start a new thread, add to existing posteriordb thread Beta-release Bayesian Posterior Database, or make an issue in posteriordb repo.

Yes. posteriordb project is independent from warmup. We don’t have yet that many models, but we have a student working part time filling the database with models. It would be great if we get more people involved. One missing part is easy to configure (e.g. what information is collected) and use cluster-computing aware scripts. See also PR for use cases for posteriordb https://github.com/MansMeg/posteriordb/pull/120

Got a model that seems to break this (or at least do far worse).

It’s a latent variable model which suffers from factor indeterminancy (so the loadings and factor scores are reflected around zero between chains). Because of this, the ‘raw’ RHats and ESS for the sampled parameters are hot garbage, but the sign-corrected parameters are all very well-behaved.

But it doesn’t look like the adaptive warmup is well-suited for models where the chains intentionally don’t converge:

Metric Standard Cross-Chain
n_warmup 1000 1000
sum_warmup_leapfrogs 536708 2289660
mean_warmup_leapfrogs 536.708 2289.66
sum_leapfrogs 251994 2172000
mean_leapfrogs 251.994 2172
bulk_ess_per_iter 0.006 0.004
tail_ess_per_iter 0.027 0.005
bulk_ess_per_leapfrog 2.381e-05 1.841e-06
tail_ess_per_leapfrog 0.00011 2.302e-06

Standard stepsizes:
0.0529953 0.0480370 0.0699596 0.0610157

Cross-chain stepsizes:
0.000232466 0.068949300 0.000606891 0.044330200

I’ve attached the model and data if anyone wants to experiment further

data.R (267.7 KB) F1_Base_GT.stan (2.1 KB)

5 Likes