Divergence vs hitting max_treedepth

Context: I am trying to optimize on the lengths of chains (warmup and iter) and other tuning parameters (window, term_buffer, max_treedepth, metric=dense_e vs. diag_e, adapt_delta) to minimize the run time while keeping at acceptable levels the N_Eff (~400+) and Rhat (<= 1.01). I have some hard-to-estimate parameters (eg, w[1] and p[3] in these traceplots: dense54K,600,200,11,0.98,75,50,35.pdf (98.3 KB) ) and need a large amount of simulated data points (5,400 to 9,000) to get reasonable estimates of the true parameters. To keep the run time to a couple to several hours, I am focusing on warmup=400 to 600 and iter=200 (x 4 chains = a total of 2400+ iterations). I notice that

  • in all those instances with a smaller number of divergences/hitting the max_treedepth that occurred in my comparison, they give very similar mean estimates.
  • a specification of turning parameters that works well (no warning) for a 9K data sample does not necessarily work for a 5.4K sample (eg, a couple of divergences/hitting the max_treedepth or more or even both)
  • sometimes the run time can be shorter for a 9K sample than a 5.4K one because, I guess, a larger sample can give more information that helps pinning down the hard-to-estimate parameters faster.

Questions:

  • I know there have been related discussions (on divergences and max_treedepth) but would appreciate a bit more on whether 1 or 2 divergences/hitting the max_treedepth is accceptable under certain conditions.
    • even this post by @betanalpha mentions that “1 of 10000 iterations ended with a divergence (0.01%)which is indicative of the divergences being false positives.
  • If two specifications giving 1 divergence in the former and 1 instance of hitting the max_treedepth in the latter, is it fair to say that the latter is likely to be less biased than the former (1 divergenece)?

Would appreciate very much your feedback. Thanks.

A few divergences are much worse than hitting the maximum treedepth a few times.

1 Like