Context: I am trying to optimize on the lengths of chains (
iter) and other tuning parameters (
adapt_delta) to minimize the run time while keeping at acceptable levels the
N_Eff (~400+) and
Rhat (<= 1.01). I have some hard-to-estimate parameters (eg,
p in these traceplots: dense54K,600,200,11,0.98,75,50,35.pdf (98.3 KB) ) and need a large amount of simulated data points (5,400 to 9,000) to get reasonable estimates of the true parameters. To keep the run time to a couple to several hours, I am focusing on
iter=200 (x 4 chains = a total of 2400+ iterations). I notice that
- in all those instances with a smaller number of divergences/hitting the
max_treedepththat occurred in my comparison, they give very similar mean estimates.
- a specification of turning parameters that works well (no warning) for a 9K data sample does not necessarily work for a 5.4K sample (eg, a couple of divergences/hitting the max_treedepth or more or even both)
- sometimes the run time can be shorter for a 9K sample than a 5.4K one because, I guess, a larger sample can give more information that helps pinning down the hard-to-estimate parameters faster.
- I know there have been related discussions (on divergences and max_treedepth) but would appreciate a bit more on whether 1 or 2 divergences/hitting the
max_treedepthis accceptable under certain conditions.
- If two specifications giving 1 divergence in the former and 1 instance of hitting the
max_treedepthin the latter, is it fair to say that the latter is likely to be less biased than the former (1 divergenece)?
Would appreciate very much your feedback. Thanks.