I have found the following things reasonable:
- Using a dense metric appears to significantly improve every metric of interest. I guess that is because there may frequently arise correlations between variables in the posterior?
- Playing with the tolerances may lead to catastrophe or to salvation.
- The Adjoint ODE Prototype - RFC - Please Test appears to become competitive very quickly.
Furthermore, for the way I set up my toy models, the following worked out exceedingly well, significantly reducing walltime and runtime differences between chains and increasing ESS:
- Instead of a single run using all of the data, warm the model up in chunks. I.e. take 1,2,4,8,16,… timesteps, and for each run perform a short warmup phase with
adapt_init_phase=25, adapt_metric_window=100, adapt_step_size=0, iter_warmup=125
and initialized with a metric computed from the (relevant) warmup samples from all chains from the previous run. For the final run first do the above, compute the metric, and then do the sampling preceeded by the standard window to adapt the step size.
Is there anything subtly or not so subtly wrong with the above?