Context: I am trying to optimize on the lengths of chains (warmup
and iter
) and other tuning parameters (window
, term_buffer
, max_treedepth
, metric=dense_e
vs. diag_e
, adapt_delta
) to minimize the run time while keeping at acceptable levels the N_Eff
(~400+) and Rhat
(<= 1.01). I have some hard-to-estimate parameters (eg, w[1]
and p[3]
in these traceplots: dense54K,600,200,11,0.98,75,50,35.pdf (98.3 KB) ) and need a large amount of simulated data points (5,400 to 9,000) to get reasonable estimates of the true parameters. To keep the run time to a couple to several hours, I am focusing on warmup=400
to 600
and iter=200
(x 4 chains = a total of 2400+ iterations). I notice that
- in all those instances with a smaller number of divergences/hitting the
max_treedepth
that occurred in my comparison, they give very similar mean estimates. - a specification of turning parameters that works well (no warning) for a 9K data sample does not necessarily work for a 5.4K sample (eg, a couple of divergences/hitting the max_treedepth or more or even both)
- sometimes the run time can be shorter for a 9K sample than a 5.4K one because, I guess, a larger sample can give more information that helps pinning down the hard-to-estimate parameters faster.
Questions:
- I know there have been related discussions (on divergences and max_treedepth) but would appreciate a bit more on whether 1 or 2 divergences/hitting the
max_treedepth
is accceptable under certain conditions.- even this post by @betanalpha mentions that “1 of 10000 iterations ended with a divergence (0.01%) … which is indicative of the divergences being false positives.”
- If two specifications giving 1 divergence in the former and 1 instance of hitting the
max_treedepth
in the latter, is it fair to say that the latter is likely to be less biased than the former (1 divergenece)?
Would appreciate very much your feedback. Thanks.