Problems remain after non-centered parametrization of correlated parameters

After comparing many combinations of the tuning parameters (warmup, iter, window , term_buffer , max_treedepth , metric=dense_e vs. diag_e , adapt_delta), the following combinations seem to give the shortest run time while keeping the sampling results at an acceptable level:

diag5.4K[35,150,0.82,600]_5.8 1.4hrs.txt (4.3 KB)
dense5.4K[35,150,0.982,600]_6.7 1.4hrs.txt (4.3 KB)
diag9K[35,150,0.82,600]_12 4.1hrs.txt (4.3 KB)
dense9K[25,50,0.982,600]_13.7 3.7hrs.txt (4.3 KB)

The takeways seem to be

  • For a given amount of simulated data points (9K or 5.4K), diag_e runs a bit faster than dense_e mainly because the former can take a lower level of adapt_delta (0.82, instead of 0.982) without leading to divergences.
    • But as far as the parameters of interest are concerned, the N_Effs from diag_e seem to be more diverse, with the lowest being acceptable but substantially lower than the lowest of those from dense_e.
  • The default choices of window, init_buffer, and term_buffer (=25, 75, 50) appear to be quite good: alternative choices do not lead to dramatic improvements, especially when the sample size is smaller (ie, 5.4K, rather than 90K).
    • (window, term_buffer) = (35, 150) performs slightly better. Not sure of the reason but my guess is that a larger window (35, rather than 25) reduces the iterations spent on the last and slowiest stage of the adaptation process, where further search for improvement has little incremental gains. On the other hand, a larger term_buffer (150, rather than 50) allocates more iternations to optimally adjust the initial typical set of estimates according to the findings from the adaptation process.
    • Despite the rationalization given above, increasing (window, term_buffer) beyond (35, 150) would not necessarily improve further owing to the complicated adaptation process. There is simply no obvious relation between the run time and these tuning parameters. Apparently, they interact in a complicated way with the warmup and iter to determine the run time.
  • During the course of my comparison, I have the impression that choosing an unnecessarily high level of adapt_delta often results in hitting the max_treedepth (many such instances can lead to a rather long run time). So to avoid divergences, it’s about finding the suitable level of adapt_delta rather than the higher the better.