Problems remain after non-centered parametrization of correlated parameters

tlyim · June 8, 2019, 2:36pm

Would appreciate your clarification of the following issues that seem confusing.

In the documentation (eg, p.29 of cmdstan-guide-2.19.1.pdf), the reported divergent__ row has the NEff equal to the 1000 iternations saved (but in my case below, I have -nan) and also Rhat = nan (my case: 9.99e-01). Are these NEff and Rhat, in either case, meaningful at all?

The Mean in my case is non-zero but very small (5.00e-03). Is this merely due to trying to report an integer value in floating point? (In other words, that reported value should be taken as 0.)

Inference for Stan model: SSM0_5_2dev_model
4 chains: each with iter=(400,400,400,400); warmup=(0,0,0,0); thin=(1,1,1,1); 1600 iterations saved.
Warmup took (7133, 7125, 7429, 6727) seconds, 7.89 hours total
Sampling took (3443, 3584, 3725, 3661) seconds, 4.00 hours total
                         Mean      MCSE    StdDev         5%        50%        95%     N_Eff   N_Eff/s     R_hat
lp__                 2.07e+04  5.76e-01  1.39e+01   2.07e+04   2.07e+04   2.07e+04  5.82e+02  4.04e-02  1.00e+00
accept_stat__        9.62e-01  2.75e-03  8.83e-02   8.48e-01   9.88e-01   9.99e-01  1.03e+03  7.16e-02  1.01e+00
stepsize__           1.79e-02  6.50e-04  9.20e-04   1.69e-02   1.79e-02   1.94e-02  2.00e+00  1.39e-04  5.60e+13
treedepth__          9.00e+00  5.04e-03  1.98e-01   9.00e+00   9.00e+00   9.00e+00  1.55e+03  1.08e-01  1.00e+00
n_leapfrog__         5.46e+02  3.59e+00  1.34e+02   5.11e+02   5.11e+02   1.02e+03  1.39e+03  9.63e-02  1.02e+00
divergent__          5.00e-03      -nan  7.06e-02   0.00e+00   0.00e+00   0.00e+00      -nan      -nan  9.99e-01
energy__            -2.05e+04  8.28e-01  1.91e+01  -2.06e+04  -2.05e+04  -2.05e+04  5.30e+02  3.68e-02  1.01e+00
sd_y                 8.01e-02  1.35e-05  5.74e-04   7.91e-02   8.01e-02   8.10e-02  1.80e+03  1.25e-01  9.98e-01
mu_u1                1.02e-01  1.99e-04  8.99e-03   8.65e-02   1.02e-01   1.16e-01  2.03e+03  1.41e-01  1.00e+00
mu_alpha             4.17e-02  4.61e-05  1.86e-03   3.87e-02   4.18e-02   4.47e-02  1.63e+03  1.13e-01  9.99e-01
beta                 5.93e-01  1.87e-04  7.52e-03   5.80e-01   5.93e-01   6.05e-01  1.62e+03  1.12e-01  1.00e+00
theta                1.53e-01  7.81e-05  3.24e-03   1.48e-01   1.53e-01   1.59e-01  1.71e+03  1.19e-01  1.00e+00
sd_season            1.02e-01  1.03e-04  4.25e-03   9.46e-02   1.01e-01   1.09e-01  1.70e+03  1.18e-01  1.00e+00
mu_season[1]        -1.26e-01  2.23e-04  1.05e-02  -1.44e-01  -1.26e-01  -1.09e-01  2.20e+03  1.53e-01  1.00e+00
mu_season[2]        -7.06e-02  2.23e-04  1.02e-02  -8.78e-02  -7.04e-02  -5.45e-02  2.07e+03  1.44e-01  1.00e+00
mu_season[3]         1.46e-01  2.42e-04  1.03e-02   1.28e-01   1.46e-01   1.63e-01  1.82e+03  1.27e-01  1.00e+00
p[1]                 6.45e-01  1.07e-03  3.83e-02   5.98e-01   6.38e-01   7.19e-01  1.27e+03  8.81e-02  1.00e+00
p[2]                 6.27e-01  1.20e-04  5.35e-03   6.18e-01   6.27e-01   6.36e-01  1.98e+03  1.37e-01  9.99e-01
p[3]                 5.97e-01  1.54e-03  5.48e-02   5.30e-01   5.87e-01   7.04e-01  1.27e+03  8.79e-02  1.00e+00
g[1]                 8.36e-01  1.09e-03  4.36e-02   7.65e-01   8.36e-01   9.10e-01  1.62e+03  1.12e-01  1.00e+00
g[2]                 3.37e-01  5.20e-04  2.02e-02   3.05e-01   3.37e-01   3.71e-01  1.51e+03  1.05e-01  1.00e+00
w[1]                 7.33e-01  1.53e-03  6.31e-02   6.24e-01   7.34e-01   8.35e-01  1.70e+03  1.18e-01  9.99e-01
w[2]                 1.41e-01  3.11e-04  1.26e-02   1.20e-01   1.41e-01   1.62e-01  1.66e+03  1.15e-01  9.99e-01
w[3]                 5.94e-01  4.87e-04  1.95e-02   5.61e-01   5.94e-01   6.25e-01  1.60e+03  1.11e-01  9.99e-01
d[1]                 6.68e-02  2.74e-04  1.07e-02   4.91e-02   6.69e-02   8.42e-02  1.54e+03  1.07e-01  1.00e+00
d[2]                 6.98e-01  2.17e-04  9.15e-03   6.83e-01   6.98e-01   7.13e-01  1.78e+03  1.23e-01  9.99e-01
d[3]                 2.35e-01  2.62e-04  1.10e-02   2.17e-01   2.35e-01   2.53e-01  1.78e+03  1.23e-01  1.00e+00

bbbales2 · June 8, 2019, 3:14pm

Don’t worry about Rhat for the divergent__ parameter, that’s for the parameters mostly.

5e-3 is the fraction of post-warmup trajectories that were divergent in all chains, so there were 5e-3 * 1600 = 8 divergences across all four chains.

This model looks like it took way longer than the previous one to run? Nearly 12 hours four 1600 draws vs. 7 hours for 4000 draws? And the treedepth is way larger and stepsize way smaller. Was this with more data or something?

Was the adapt_delta higher in this run? I guess if it has to be higher to keep divergences down then there hasn’t been any real advantage to switching to the dense adaptation.

This is the clearest document on what divergences can mean: https://mc-stan.org/users/documentation/case-studies/divergences_and_bias.html . In that case the divergences are happening in 2% of trajectories and they do not go away with an adapt_delta closer to 1.

The warning divergences give is that you might not be sampling the distribution you think you are. I guess since you have results from the diagonal adaptation which isn’t producing divergences, you could make predictions with each of them with example data and see if there are differences.

If you can check your results against the model run with diagonal adaptation. Rhat occasionally ends up less than 1.0. I don’t think what you’re seeing in your results here it’s a thing to worry about.

tlyim · June 8, 2019, 11:24pm

I have been assuming that the 1.01 threshold for Rhat you mentioned before is equivalent to a 0.99 threshold. So only the deviation from 1 matters, right?

Thanks for the clarification and additional information. I need to look into the issues involved more carefully and systematically. Let me simply reply to your question about the slow run time: I suspect that my changing the p[3]'s true parameter value resulted in divergences, which typically would lead to a much slower run time.

The p[3] of my model has been hard to estimate reliably – usually overestimating the true value (0.35) to 0.43 (but other than that, the run time then seemed all right and I didn’t have divergences). I wanted to understand to what extent the hard-to-estimate p[3] is driven by my prior (with a mean at 0.5). So in the recent runs, I set the true value of p[3] to 0.68 (above 0.5) to see whether the estimate would fall somehwere in between, which I believe would suggest the p[3] cannot be well identified by the data and thus the estimate is always pulled toward the prior’s mean (0.5). p[1:3] are the parameters of a sigmoid function modeled with inv_logit(). I guess the nonlinearity might make it hard to be accurately estimated.

bbbales2 · June 9, 2019, 8:21pm

I think Rhat is supposed to be greater than or equal to 1.0. I think it being less than one just happens randomly, but isn’t a big deal. So less than or greater than one are interpreted differently.

Only suggestion I have is maybe generate a dataset that doesn’t take 12 hours to run haha. It’s gonna be hard to make much headway with calculations that slow.

tlyim · June 10, 2019, 2:36pm

Thanks for your clarification and suggestion. Make a lot of sense. Owing to some other considerations, for now I still use 10,000 simulated data points.

I am returning to more conservative choices of warmup (1000) and delta (0.99) to first ensure no divergence. Just finished a comparison of the dense vs diag runs on the slower cluster I have access to. I believe the results confirm that it’s fine (ie, similar results) and benefitial (ie, shorter run time) to use the dense metric.

Would appreciate your clarification of one thing though. In the diag results below, the rows from stepsize to n_leapfrog have MCSE and N_Eff reported as -nan, which is different from those reported in the dense results. Is this difference something to worry about by me as a model builder?

diag results:

4 chains: each with iter=(350,350,350,350); warmup=(0,0,0,0); thin=(1,1,1,1); 1400 iterations saved.
Warmup took (30658, 31333, 31104, 31588) seconds, 34.6 hours total
Sampling took (11519, 11515, 11542, 11483) seconds, 12.8 hours total
                         Mean      MCSE    StdDev         5%        50%        95%     N_Eff   N_Eff/s     R_hat
lp__                 2.07e+04  6.17e-01  1.41e+01   2.07e+04   2.07e+04   2.07e+04  5.22e+02  1.13e-02  1.01e+00
accept_stat__        9.92e-01  1.43e-03  1.31e-02   9.66e-01   9.98e-01   1.00e+00  8.43e+01  1.83e-03  1.03e+00
stepsize__           2.05e-03      -nan  5.46e-04   1.41e-03   2.11e-03   2.90e-03      -nan      -nan  4.32e+14
treedepth__          1.00e+01      -nan  3.02e-14   1.00e+01   1.00e+01   1.00e+01      -nan      -nan  9.97e-01
n_leapfrog__         1.02e+03      -nan  8.98e-12   1.02e+03   1.02e+03   1.02e+03      -nan      -nan  9.97e-01
divergent__          0.00e+00      -nan  0.00e+00   0.00e+00   0.00e+00   0.00e+00      -nan      -nan      -nan
energy__            -2.05e+04  8.54e-01  1.95e+01  -2.06e+04  -2.05e+04  -2.05e+04  5.20e+02  1.13e-02  1.01e+00
sd_y                 8.01e-02  2.65e-05  5.77e-04   7.92e-02   8.01e-02   8.10e-02  4.74e+02  1.03e-02  1.00e+00
mu_u1                1.01e-01  5.45e-04  8.94e-03   8.58e-02   1.01e-01   1.15e-01  2.69e+02  5.85e-03  1.01e+00
mu_alpha             4.16e-02  1.18e-04  1.85e-03   3.87e-02   4.16e-02   4.47e-02  2.47e+02  5.36e-03  1.01e+00
beta                 5.93e-01  7.16e-04  7.70e-03   5.80e-01   5.93e-01   6.06e-01  1.16e+02  2.51e-03  1.03e+00
theta                1.53e-01  3.12e-04  3.36e-03   1.48e-01   1.53e-01   1.58e-01  1.16e+02  2.53e-03  1.03e+00
sd_season            1.02e-01  2.36e-04  4.21e-03   9.44e-02   1.02e-01   1.09e-01  3.19e+02  6.92e-03  1.02e+00
mu_season[1]        -1.27e-01  3.80e-04  9.74e-03  -1.43e-01  -1.27e-01  -1.10e-01  6.56e+02  1.42e-02  1.00e+00
mu_season[2]        -6.99e-02  5.43e-04  1.01e-02  -8.60e-02  -6.97e-02  -5.31e-02  3.47e+02  7.54e-03  1.01e+00
mu_season[3]         1.45e-01  5.61e-04  1.00e-02   1.29e-01   1.45e-01   1.62e-01  3.20e+02  6.95e-03  1.00e+00
p[1]                 6.43e-01  3.36e-03  3.69e-02   5.98e-01   6.36e-01   7.09e-01  1.21e+02  2.62e-03  1.03e+00
p[2]                 6.27e-01  2.00e-04  5.64e-03   6.18e-01   6.27e-01   6.36e-01  7.94e+02  1.72e-02  1.00e+00
p[3]                 5.95e-01  4.78e-03  5.26e-02   5.31e-01   5.85e-01   6.89e-01  1.21e+02  2.64e-03  1.03e+00
g[1]                 8.39e-01  2.58e-03  4.22e-02   7.72e-01   8.39e-01   9.09e-01  2.68e+02  5.82e-03  1.02e+00
g[2]                 3.37e-01  1.26e-03  1.95e-02   3.04e-01   3.37e-01   3.68e-01  2.39e+02  5.18e-03  1.02e+00
w[1]                 7.30e-01  4.82e-03  6.03e-02   6.25e-01   7.31e-01   8.23e-01  1.57e+02  3.40e-03  1.04e+00
w[2]                 1.41e-01  7.64e-04  1.24e-02   1.22e-01   1.41e-01   1.62e-01  2.62e+02  5.69e-03  1.02e+00
w[3]                 5.92e-01  9.68e-04  1.88e-02   5.61e-01   5.92e-01   6.22e-01  3.77e+02  8.18e-03  1.00e+00
d[1]                 6.71e-02  6.41e-04  1.04e-02   4.97e-02   6.72e-02   8.40e-02  2.65e+02  5.74e-03  1.00e+00
d[2]                 6.98e-01  4.35e-04  9.29e-03   6.82e-01   6.98e-01   7.14e-01  4.57e+02  9.91e-03  1.00e+00
d[3]                 2.35e-01  7.15e-04  1.10e-02   2.16e-01   2.36e-01   2.53e-01  2.38e+02  5.18e-03  1.01e+00

dense results:

4 chains: each with iter=(350,350,350,350); warmup=(0,0,0,0); thin=(1,1,1,1); 1400 iterations saved.
Warmup took (21000, 20885, 20857, 22902) seconds, 23.8 hours total
Sampling took (7961, 4248, 7934, 8000) seconds, 7.82 hours total
                         Mean      MCSE    StdDev         5%        50%        95%     N_Eff   N_Eff/s     R_hat
lp__                 2.07e+04  6.02e-01  1.34e+01   2.07e+04   2.07e+04   2.07e+04  4.94e+02  1.75e-02  1.01e+00
accept_stat__        9.90e-01  5.44e-04  2.08e-02   9.57e-01   9.96e-01   1.00e+00  1.47e+03  5.21e-02  1.01e+00
stepsize__           1.05e-02  1.52e-03  2.15e-03   8.80e-03   9.84e-03   1.42e-02  2.00e+00  7.12e-05  5.08e+14
treedepth__          9.73e+00  2.98e-01  4.46e-01   9.00e+00   1.00e+01   1.00e+01  2.23e+00  7.92e-05  3.00e+00
n_leapfrog__         8.98e+02  1.46e+02  2.20e+02   5.11e+02   1.02e+03   1.02e+03  2.27e+00  8.07e-05  2.79e+00
divergent__          0.00e+00      -nan  0.00e+00   0.00e+00   0.00e+00   0.00e+00      -nan      -nan      -nan
energy__            -2.05e+04  8.71e-01  1.91e+01  -2.06e+04  -2.05e+04  -2.05e+04  4.81e+02  1.71e-02  1.01e+00
sd_y                 8.01e-02  1.41e-05  5.67e-04   7.91e-02   8.01e-02   8.10e-02  1.61e+03  5.73e-02  9.99e-01
mu_u1                1.02e-01  2.08e-04  8.64e-03   8.75e-02   1.02e-01   1.16e-01  1.73e+03  6.13e-02  9.99e-01
mu_alpha             4.17e-02  4.73e-05  1.87e-03   3.86e-02   4.16e-02   4.48e-02  1.57e+03  5.57e-02  1.00e+00
beta                 5.92e-01  1.79e-04  7.24e-03   5.80e-01   5.92e-01   6.04e-01  1.63e+03  5.79e-02  9.98e-01
theta                1.53e-01  7.53e-05  3.14e-03   1.48e-01   1.53e-01   1.59e-01  1.73e+03  6.15e-02  9.99e-01
sd_season            1.01e-01  9.44e-05  4.18e-03   9.49e-02   1.01e-01   1.09e-01  1.96e+03  6.97e-02  9.99e-01
mu_season[1]        -1.26e-01  2.23e-04  1.01e-02  -1.43e-01  -1.26e-01  -1.09e-01  2.05e+03  7.30e-02  9.99e-01
mu_season[2]        -7.02e-02  2.50e-04  1.02e-02  -8.74e-02  -7.03e-02  -5.30e-02  1.66e+03  5.90e-02  9.99e-01
mu_season[3]         1.46e-01  2.48e-04  1.01e-02   1.29e-01   1.46e-01   1.63e-01  1.66e+03  5.92e-02  1.00e+00
p[1]                 6.45e-01  1.11e-03  3.86e-02   5.98e-01   6.38e-01   7.19e-01  1.21e+03  4.31e-02  9.99e-01
p[2]                 6.27e-01  1.23e-04  5.46e-03   6.18e-01   6.27e-01   6.36e-01  1.97e+03  7.01e-02  1.00e+00
p[3]                 5.97e-01  1.58e-03  5.52e-02   5.30e-01   5.88e-01   7.05e-01  1.22e+03  4.32e-02  9.98e-01
g[1]                 8.36e-01  1.12e-03  4.61e-02   7.60e-01   8.36e-01   9.10e-01  1.71e+03  6.08e-02  9.99e-01
g[2]                 3.38e-01  5.04e-04  1.99e-02   3.05e-01   3.37e-01   3.71e-01  1.56e+03  5.54e-02  1.00e+00
w[1]                 7.34e-01  1.63e-03  6.55e-02   6.27e-01   7.36e-01   8.37e-01  1.61e+03  5.72e-02  1.00e+00
w[2]                 1.41e-01  2.97e-04  1.31e-02   1.20e-01   1.41e-01   1.62e-01  1.94e+03  6.89e-02  1.00e+00
w[3]                 5.94e-01  4.72e-04  1.98e-02   5.63e-01   5.93e-01   6.28e-01  1.76e+03  6.26e-02  9.98e-01
d[1]                 6.73e-02  2.60e-04  1.03e-02   5.01e-02   6.75e-02   8.37e-02  1.57e+03  5.57e-02  1.00e+00
d[2]                 6.98e-01  2.36e-04  9.39e-03   6.83e-01   6.98e-01   7.14e-01  1.58e+03  5.61e-02  1.00e+00
d[3]                 2.35e-01  2.69e-04  1.09e-02   2.17e-01   2.35e-01   2.53e-01  1.64e+03  5.82e-02  1.00e+00

bbbales2 · June 10, 2019, 3:42pm

Nah, don’t worry about the stats for stepsize or n_leapfrog. It’s just the distribution of stepsizes and number of leapfrog steps used in each trajectory. NaNs just indicate every HMC trajectory used the same stepsize and n_leapfrog in the diagonal case.

I dunno why the MCSE for dense isn’t NaNs too for stepsize, cause that should be constant. I’d assume if you did the stansummary over 1 chain it still would be.

Warmup took (21000, 20885, 20857, 22902) seconds, 23.8 hours total
Sampling took (7961, 4248, 7934, 8000) seconds, 7.82 hours total

Ouch these warmup times are rough. You might benefit from doing a calculation with save_warmup=1 and examining what’s happening at warmup. I’m not sure what you might find but maybe there’s something there.

tlyim · June 19, 2019, 4:28pm

Thanks for the suggestions in your reply to my another post. I am working on the plotting as you suggest.

Meanwhile, I have a look at the effect of changing the max_depth and realize that setting it to the 9 or 10 that I had been using leads to 600 or 121 transitions hitting the limit, resp. (when running 3 x 200 = 600 transitions). Increasing to max_depth=11 or above avoids hitting the limit completely.

The max_depth=9 case also has the issue of split R-hat greater than 1.1: gw_sd[3], gw_sd[5]. But all the parameters of interest (reported below) have R-hat ~ 1.0 and the run time is significantly shorter than in the max_depth=10 case (where no R-hat > 1.1 at all).
- Should R-hat > 1.1 be avoided at all cost even when (i) the problem does not affect the good Rhat of the parameters of interest; (ii) their mean estimates remain very similar; (iii) ignoring the issue gives a much shorter run time ?
I have looked at max_depth=11 to 13 as well. The warmup time increases quite substantially even though the final sampling time is very similar and so are the mean estimates, N_Eff, and Rhat of the parameters of interest.
- Is it still worth trying much bigger max_depth, like 20, 30, …?
Had a look at the CmdStan manual concerning init_buffer, term_buffer, and window, and the adaptation process. I guess I have a basic understanding. Tried to look around (like the Stan user guide, forum, …) and found only very limited related info (such as this post). Another discussion by @aaronjg sounds very relevant but most of the responses in that post are at a level too technical for me to comprehend.
- My take from that post is that I can reduce the overall warmup time by forcing more iterations into the fast init_buffer and term_buffer intervals (eg, 200, instead of the default 75). Am I getting this right? Is it useful to adjust the stepsize at the same time and if so, how?

Would appreciate very much your feedback. Many thanks.

====================================================

max_depth=9:

600 of 600 (1e+02%) transitions hit the maximum treedepth limit of 9, or 2^9 leapfrog steps. 
Trajectories that are prematurely terminated due to this limit will result in slow exploration 
and you should increase the limit to ensure optimal performance.

The following parameters had split R-hat greater than 1.1:
  gw_sd[3], gw_sd[5]
Such high values indicate incomplete mixing and biasedestimation.  
You should consider regularization your model with additional prior information 
or looking for a more effective parameterization.
 
3 chains: each with iter=(200,200,200); warmup=(0,0,0); thin=(1,1,1); 600 iterations saved.
Warmup took (11398, 11641, 10726) seconds, 9.4 hours total
Sampling took (2992, 2994, 3009) seconds, 2.5 hours total
                       Mean     MCSE   StdDev        5%       50%       95%    N_Eff  N_Eff/s    R_hat
lp__                1.9e+04  1.2e+00  1.3e+01   1.9e+04   1.9e+04   1.9e+04  1.2e+02  1.3e-02  1.0e+00
accept_stat__       9.9e-01  1.6e-03  2.4e-02   9.5e-01   1.0e+00   1.0e+00  2.3e+02  2.5e-02  1.0e+00
stepsize__          1.4e-02  1.7e-03  2.1e-03   1.1e-02   1.4e-02   1.6e-02  1.5e+00  1.7e-04  8.0e+14
treedepth__         9.0e+00     -nan  1.4e-14   9.0e+00   9.0e+00   9.0e+00     -nan     -nan  9.9e-01
n_leapfrog__        5.1e+02     -nan  2.1e-12   5.1e+02   5.1e+02   5.1e+02     -nan     -nan     -nan
divergent__         0.0e+00     -nan  0.0e+00   0.0e+00   0.0e+00   0.0e+00     -nan     -nan     -nan
energy__           -1.8e+04  1.5e+00  1.9e+01  -1.8e+04  -1.8e+04  -1.8e+04  1.5e+02  1.7e-02  1.0e+00
sd_y                8.1e-02  2.0e-05  5.9e-04   8.0e-02   8.1e-02   8.2e-02  8.9e+02  9.9e-02  1.0e+00
mu_u1               8.1e-02  3.2e-04  9.2e-03   6.5e-02   8.1e-02   9.5e-02  8.4e+02  9.4e-02  1.0e+00
mu_alpha            4.3e-02  7.6e-05  1.9e-03   4.0e-02   4.3e-02   4.6e-02  6.5e+02  7.2e-02  1.0e+00
beta                5.8e-01  2.8e-04  8.1e-03   5.7e-01   5.8e-01   6.0e-01  8.0e+02  8.9e-02  1.0e+00
theta               1.6e-01  1.1e-04  3.4e-03   1.5e-01   1.6e-01   1.6e-01  9.6e+02  1.1e-01  1.0e+00
sd_season           9.9e-02  1.7e-04  4.4e-03   9.1e-02   9.8e-02   1.1e-01  6.8e+02  7.5e-02  1.0e+00
mu_season[1]       -1.2e-01  3.6e-04  1.1e-02  -1.4e-01  -1.2e-01  -1.0e-01  8.5e+02  9.4e-02  1.0e+00
mu_season[2]       -6.9e-02  4.1e-04  1.0e-02  -8.6e-02  -7.0e-02  -5.2e-02  6.1e+02  6.8e-02  1.0e+00
mu_season[3]        1.4e-01  4.0e-04  1.0e-02   1.2e-01   1.4e-01   1.6e-01  6.3e+02  7.0e-02  1.0e+00
p[1]                7.0e-01  2.4e-03  5.2e-02   6.3e-01   6.9e-01   7.9e-01  4.8e+02  5.3e-02  1.0e+00
p[2]                6.2e-01  2.5e-04  6.0e-03   6.1e-01   6.2e-01   6.3e-01  5.8e+02  6.5e-02  1.0e+00
p[3]                6.8e-01  3.4e-03  7.4e-02   5.8e-01   6.6e-01   8.2e-01  4.7e+02  5.3e-02  1.0e+00
g[1]                8.5e-01  1.8e-03  4.8e-02   7.8e-01   8.6e-01   9.3e-01  6.8e+02  7.6e-02  1.0e+00
g[2]                3.3e-01  8.1e-04  2.0e-02   2.9e-01   3.3e-01   3.6e-01  6.4e+02  7.1e-02  1.0e+00
w[1]                6.4e-01  3.6e-03  6.9e-02   5.3e-01   6.3e-01   7.5e-01  3.7e+02  4.1e-02  1.0e+00
w[2]                1.5e-01  6.0e-04  1.3e-02   1.3e-01   1.5e-01   1.8e-01  4.7e+02  5.2e-02  1.0e+00
w[3]                5.8e-01  8.5e-04  2.0e-02   5.5e-01   5.8e-01   6.2e-01  5.8e+02  6.4e-02  1.0e+00
d[1]                4.3e-02  5.5e-04  1.1e-02   2.6e-02   4.3e-02   6.0e-02  3.8e+02  4.2e-02  1.0e+00
d[2]                7.1e-01  3.7e-04  9.5e-03   6.9e-01   7.1e-01   7.2e-01  6.8e+02  7.5e-02  1.0e+00
d[3]                2.5e-01  4.6e-04  1.1e-02   2.3e-01   2.5e-01   2.7e-01  5.9e+02  6.6e-02  1.0e+00

max_depth=10:

121 of 600 (20%) transitions hit the maximum treedepth limit of 10, or 2^10 leapfrog steps. 
Trajectories that are prematurely terminated due to this limit will result in slow exploration 
and you should increase the limit to ensure optimal performance.

3 chains: each with iter=(200,200,200); warmup=(0,0,0); thin=(1,1,1); 600 iterations saved.
Warmup took (21050, 20714, 20333) seconds, 17 hours total
Sampling took (3058, 5142, 4021) seconds, 3.4 hours total
                       Mean     MCSE   StdDev        5%       50%       95%    N_Eff  N_Eff/s    R_hat
lp__                1.9e+04  9.5e-01  1.4e+01   1.9e+04   1.9e+04   1.9e+04  2.1e+02  1.7e-02  1.0e+00
accept_stat__       9.8e-01  1.5e-03  3.7e-02   9.4e-01   1.0e+00   1.0e+00  5.9e+02  4.9e-02  1.0e+00
stepsize__          1.4e-02  1.4e-03  1.7e-03   1.1e-02   1.4e-02   1.5e-02  1.5e+00  1.2e-04  2.6e+14
treedepth__         9.2e+00     -nan  4.0e-01   9.0e+00   9.0e+00   1.0e+01     -nan     -nan  1.2e+00
n_leapfrog__        6.9e+02  1.2e+02  2.4e+02   5.1e+02   5.1e+02   1.0e+03  3.9e+00  3.2e-04  1.3e+00
divergent__         0.0e+00     -nan  0.0e+00   0.0e+00   0.0e+00   0.0e+00     -nan     -nan     -nan
energy__           -1.8e+04  1.3e+00  1.9e+01  -1.8e+04  -1.8e+04  -1.8e+04  2.0e+02  1.6e-02  1.0e+00
sd_y                8.1e-02  2.2e-05  6.2e-04   8.0e-02   8.1e-02   8.2e-02  8.1e+02  6.6e-02  1.0e+00
mu_u1               8.0e-02  3.6e-04  9.8e-03   6.4e-02   8.1e-02   9.6e-02  7.6e+02  6.2e-02  1.0e+00
mu_alpha            4.3e-02  7.5e-05  1.9e-03   4.0e-02   4.3e-02   4.6e-02  6.3e+02  5.1e-02  1.0e+00
beta                5.9e-01  2.9e-04  7.8e-03   5.7e-01   5.8e-01   6.0e-01  7.0e+02  5.7e-02  1.0e+00
theta               1.6e-01  1.3e-04  3.3e-03   1.5e-01   1.6e-01   1.6e-01  6.7e+02  5.5e-02  1.0e+00
sd_season           9.9e-02  1.5e-04  4.5e-03   9.1e-02   9.9e-02   1.1e-01  8.5e+02  7.0e-02  1.0e+00
mu_season[1]       -1.2e-01  3.2e-04  1.0e-02  -1.4e-01  -1.2e-01  -1.0e-01  1.0e+03  8.3e-02  1.0e+00
mu_season[2]       -6.9e-02  3.9e-04  1.1e-02  -8.7e-02  -6.9e-02  -5.1e-02  7.4e+02  6.0e-02  1.0e+00
mu_season[3]        1.4e-01  3.8e-04  1.0e-02   1.3e-01   1.4e-01   1.6e-01  7.2e+02  5.9e-02  1.0e+00
p[1]                7.1e-01  2.5e-03  6.0e-02   6.3e-01   7.0e-01   8.2e-01  5.8e+02  4.8e-02  1.0e+00
p[2]                6.2e-01  2.2e-04  5.7e-03   6.1e-01   6.2e-01   6.3e-01  6.6e+02  5.4e-02  1.0e+00
p[3]                6.9e-01  3.5e-03  8.6e-02   5.9e-01   6.8e-01   8.5e-01  5.9e+02  4.9e-02  1.0e+00
g[1]                8.5e-01  1.6e-03  4.9e-02   7.7e-01   8.5e-01   9.3e-01  9.4e+02  7.7e-02  1.0e+00
g[2]                3.3e-01  8.1e-04  2.0e-02   3.0e-01   3.3e-01   3.6e-01  6.4e+02  5.3e-02  1.0e+00
w[1]                6.4e-01  2.8e-03  6.4e-02   5.3e-01   6.4e-01   7.4e-01  5.3e+02  4.4e-02  1.0e+00
w[2]                1.5e-01  4.7e-04  1.3e-02   1.3e-01   1.5e-01   1.7e-01  7.0e+02  5.7e-02  1.0e+00
w[3]                5.8e-01  7.5e-04  2.1e-02   5.5e-01   5.8e-01   6.2e-01  8.2e+02  6.7e-02  1.0e+00
d[1]                4.4e-02  4.1e-04  1.1e-02   2.6e-02   4.4e-02   6.2e-02  6.9e+02  5.7e-02  1.0e+00
d[2]                7.1e-01  3.6e-04  9.9e-03   6.9e-01   7.1e-01   7.2e-01  7.4e+02  6.1e-02  1.0e+00
d[3]                2.5e-01  5.1e-04  1.1e-02   2.3e-01   2.5e-01   2.7e-01  4.8e+02  4.0e-02  1.0e+00

max_depth=11:

3 chains: each with iter=(200,200,200); warmup=(0,0,0); thin=(1,1,1); 600 iterations saved.
Warmup took (28354, 29356, 33217) seconds, 25 hours total
Sampling took (3291, 5990, 5615) seconds, 4.1 hours total
                       Mean     MCSE   StdDev        5%       50%       95%    N_Eff  N_Eff/s    R_hat
lp__                1.9e+04  9.4e-01  1.4e+01   1.9e+04   1.9e+04   1.9e+04  2.1e+02  1.4e-02  1.0e+00
accept_stat__       9.9e-01  7.4e-04  1.8e-02   9.6e-01   1.0e+00   1.0e+00  6.1e+02  4.1e-02  1.0e+00
stepsize__          1.1e-02  1.5e-03  1.9e-03   9.0e-03   1.0e-02   1.3e-02  1.5e+00  1.0e-04  3.5e+14
treedepth__         9.6e+00  3.5e-01  4.9e-01   9.0e+00   1.0e+01   1.0e+01  2.0e+00  1.4e-04  1.9e+00
n_leapfrog__        8.4e+02  1.7e+02  2.4e+02   5.1e+02   1.0e+03   1.0e+03  2.0e+00  1.3e-04  2.0e+00
divergent__         0.0e+00     -nan  0.0e+00   0.0e+00   0.0e+00   0.0e+00     -nan     -nan     -nan
energy__           -1.8e+04  1.4e+00  1.9e+01  -1.8e+04  -1.8e+04  -1.8e+04  1.9e+02  1.3e-02  1.0e+00
sd_y                8.1e-02  2.3e-05  6.2e-04   8.0e-02   8.1e-02   8.2e-02  7.1e+02  4.8e-02  1.0e+00
mu_u1               8.1e-02  3.8e-04  9.5e-03   6.5e-02   8.1e-02   9.6e-02  6.1e+02  4.1e-02  1.0e+00
mu_alpha            4.3e-02  8.0e-05  2.0e-03   4.0e-02   4.3e-02   4.6e-02  6.3e+02  4.2e-02  1.0e+00
beta                5.8e-01  3.1e-04  7.6e-03   5.7e-01   5.8e-01   6.0e-01  6.2e+02  4.2e-02  1.0e+00
theta               1.6e-01  1.3e-04  3.3e-03   1.5e-01   1.6e-01   1.6e-01  6.6e+02  4.4e-02  1.0e+00
sd_season           9.9e-02  1.6e-04  4.5e-03   9.2e-02   9.8e-02   1.1e-01  7.5e+02  5.0e-02  1.0e+00
mu_season[1]       -1.2e-01  3.6e-04  1.1e-02  -1.4e-01  -1.2e-01  -1.0e-01  8.5e+02  5.7e-02  1.0e+00
mu_season[2]       -6.9e-02  4.0e-04  1.0e-02  -8.6e-02  -6.9e-02  -5.2e-02  6.9e+02  4.6e-02  1.0e+00
mu_season[3]        1.4e-01  3.8e-04  1.1e-02   1.3e-01   1.4e-01   1.6e-01  8.1e+02  5.4e-02  1.0e+00
p[1]                7.0e-01  2.5e-03  5.6e-02   6.3e-01   7.0e-01   8.1e-01  5.1e+02  3.4e-02  1.0e+00
p[2]                6.2e-01  2.1e-04  5.5e-03   6.1e-01   6.2e-01   6.2e-01  6.8e+02  4.6e-02  1.0e+00
p[3]                6.9e-01  3.6e-03  8.0e-02   5.8e-01   6.8e-01   8.4e-01  5.0e+02  3.4e-02  1.0e+00
g[1]                8.6e-01  1.7e-03  4.6e-02   7.8e-01   8.5e-01   9.3e-01  7.4e+02  5.0e-02  1.0e+00
g[2]                3.3e-01  7.0e-04  2.1e-02   2.9e-01   3.3e-01   3.6e-01  9.4e+02  6.3e-02  1.0e+00
w[1]                6.4e-01  2.5e-03  6.6e-02   5.4e-01   6.4e-01   7.5e-01  7.1e+02  4.8e-02  1.0e+00
w[2]                1.5e-01  4.5e-04  1.2e-02   1.3e-01   1.5e-01   1.7e-01  7.3e+02  4.9e-02  1.0e+00
w[3]                5.8e-01  7.5e-04  2.0e-02   5.5e-01   5.8e-01   6.2e-01  7.1e+02  4.7e-02  1.0e+00
d[1]                4.4e-02  4.4e-04  1.1e-02   2.5e-02   4.4e-02   6.3e-02  6.6e+02  4.4e-02  1.0e+00
d[2]                7.1e-01  3.7e-04  9.9e-03   6.9e-01   7.1e-01   7.3e-01  7.3e+02  4.9e-02  1.0e+00
d[3]                2.5e-01  5.0e-04  1.2e-02   2.3e-01   2.5e-01   2.7e-01  5.8e+02  3.9e-02  1.0e+00

bbbales2 · June 19, 2019, 5:50pm

Yes, unfortunately. You might look at what the different chains are doing, or what your estimates of these parameters are. You might be able to get around this with tighter priors if you can justify that in your modeling.

No, if your Rhats are good at 11 and you’re not getting treedepth warnings, no need to make the max treedepth deeper. Also this is a log scale sorta thing. Max treedepth = 10 means that NUTS will use at most 2^10 leapfrog steps to generate a sample. 11 means NUTS will use at most 2^11 leapfrog steps, etc.

It’s only a guess. It may or may not get you anywhere, but if adaptation starts before you’re near the solution then the adaptation itself might screw things up. I would just leave the stepsize adaptation at default and first try making the init_buffer larger.

You can get the stepsize from the output csv files. If you make a plot of that you can search to see if anything weird happens (for instance, the stepsize plummets to 1e-10 in part of warmup). It looks like for the max_treedepth = 11 chain, the final stepsize you take is 1e-2, so that’s what to compare things against. I haven’t really looked into this much, but I’d guess you should stay within a couple orders of magnitude of this through the rest of adaptation. I could easily be wrong there though – just a guess.

tlyim · June 21, 2019, 12:52am

Would appreciate feedback on whether the following weird experience is a bug or my mistake.

I cannot get the warmups into the csv files (eg, samples1.csv (1.4 MB) ) even when I used the save_warmup=1 option and the stansummary report cmdStan0.5.2a_bkgd.sh.o6080738.txt (132.9 KB) suggests that this was successfully done:

3 chains: each with iter=(6,6,6); warmup=(8,8,8); thin=(1,1,1); 42 iterations saved.
...
...
Warning: non-fatal error reading adapation data
Warning: non-fatal error reading adapation data
Warning: non-fatal error reading adapation data
2 of 42 (4.8%) transitions ended with a divergence.

In the shell script (cmdStan0.5.2a_bkgd.sh.txt (3.2 KB) ), I previously used an environment variable but now hard-coded the save_warmup=1 option. This still cannot put the warmups into the csv files:

for i in {1..3}
do
nohup ../SSM0.5.2a sample algorithm=hmc engine=nuts max_depth=$MAXDEPTH metric=$METRIC num_samples=$N_ITER num_warmup=$N_WARMUP save_warmup=1 adapt delta=$DELTA init_buffer=$INIT_BUFFER term_buffer=$TERM_BUFFER window=$WINDOW random seed=$SEED id=$i data file=../SSM_data_J90d.dat output file=/scratch/city/${NOW}_${JOB_ID}/samples$i.csv > $HOME/$SHELLSCRIPT.${NOW}_${JOB_ID}.$i.out 2> $HOME/$SHELLSCRIPT.${NOW}_${JOB_ID}.$i.err &
done

jobs
wait

tlyim · June 21, 2019, 1:02am

The above was for cmdstan 2.19.1. For rstan 2.18.2, I also tried save_warmup = TRUE (or omitting save_warmup, whose default should be TRUE) to obtain the model fit. The attempt to plot with mcmc_trace() also shows no warmup at all. (Although here, I don’t have the csv files and cannot look into them to confirm that the option did not really save the warmups.)

bbbales2 · June 21, 2019, 9:57pm

Lines 40-47 in samples.csv are warmup samples. Everything before the “# Adaptation terminated”. You can parse this input using read_stan_csv in R or do it manually with read_csv or whatnot.

You have an incredibly large output. If you have things in the transformed parameters block you might move them to the model block so they don’t print.

tlyim · June 25, 2019, 3:23pm

Thanks for the tips. I summarized the current status of my exploration in this post.

tlyim · June 30, 2019, 4:23pm

After comparing many combinations of the tuning parameters (warmup, iter, window , term_buffer , max_treedepth , metric=dense_e vs. diag_e , adapt_delta), the following combinations seem to give the shortest run time while keeping the sampling results at an acceptable level:

diag5.4K[35,150,0.82,600]_5.8 1.4hrs.txt (4.3 KB)
dense5.4K[35,150,0.982,600]_6.7 1.4hrs.txt (4.3 KB)
diag9K[35,150,0.82,600]_12 4.1hrs.txt (4.3 KB)
dense9K[25,50,0.982,600]_13.7 3.7hrs.txt (4.3 KB)

The takeways seem to be

For a given amount of simulated data points (9K or 5.4K), diag_e runs a bit faster than dense_e mainly because the former can take a lower level of adapt_delta (0.82, instead of 0.982) without leading to divergences.
- But as far as the parameters of interest are concerned, the N_Effs from diag_e seem to be more diverse, with the lowest being acceptable but substantially lower than the lowest of those from dense_e.
The default choices of window, init_buffer, and term_buffer (=25, 75, 50) appear to be quite good: alternative choices do not lead to dramatic improvements, especially when the sample size is smaller (ie, 5.4K, rather than 90K).
- (window, term_buffer) = (35, 150) performs slightly better. Not sure of the reason but my guess is that a larger window (35, rather than 25) reduces the iterations spent on the last and slowiest stage of the adaptation process, where further search for improvement has little incremental gains. On the other hand, a larger term_buffer (150, rather than 50) allocates more iternations to optimally adjust the initial typical set of estimates according to the findings from the adaptation process.
- Despite the rationalization given above, increasing (window, term_buffer) beyond (35, 150) would not necessarily improve further owing to the complicated adaptation process. There is simply no obvious relation between the run time and these tuning parameters. Apparently, they interact in a complicated way with the warmup and iter to determine the run time.
During the course of my comparison, I have the impression that choosing an unnecessarily high level of adapt_delta often results in hitting the max_treedepth (many such instances can lead to a rather long run time). So to avoid divergences, it’s about finding the suitable level of adapt_delta rather than the higher the better.

bbbales2 · July 1, 2019, 9:10pm

That all sounds reasonable.

Yes

This makes me think dense would still be worth doing. But it sounds like it’s been pretty annoying to get the dense metric to run without divergences, which is no good.

Thanks for taking the time to report this stuff back!

betanalpha · July 3, 2019, 1:40am

Divergences imply that your model features strong, possibly spatially-varying, curvature that compromises the symplectic integration. Increasing adapt_delta might help, but at the cost of more expensive transitions. Really the ideal is to identify exactly which features of your model are inducing the strong curvature (often strong correlations from weak or non identifiabilities between parameters in the likelihood function) and remedy them with modification to your model implementation (reparameterization, rescaling of data, stronger priors, additional data, etc.

In other words increasing adapt_delta is a bandaid – it might help for paper cuts but slapping on more bandaid when you’re hemorrhaging isn’t going to be of much use to anyone.

Topic		Replies	Views
Adaptation and non-centered parameterisations Developers	10	1243	August 2, 2021
Divergences in a non-centered computational model Modeling fitting-issues	21	1505	October 30, 2019
How to speed up sampling in rstan? Modeling rstan , performance , hierarchical-model	6	6167	July 24, 2020
How to determine the combination of adapt_delta / max_treedepth / metric=dense_e vs. diag_e / term_buffer / window / warmup / iter? Modeling	2	1664	May 8, 2019
Non-centered parameterization for non-hierarchical parameters? Modeling reparametrization	15	1488	May 6, 2021

Problems remain after non-centered parametrization of correlated parameters

Related topics