Max_treedepth warnings in rstan 2.15.x

Operating System: Fedora Linux 25

Have warning messages changed in the current version of rstan/StanHeaders, or has the actual sampling changed in some significant manner?

A model that ran successfully in 2.14.x with at worst a divergent transition or two generated max_treedepth exceeded warnings for all or all but a few transitions after updating rstan and dependencies to 2.15.1 (installed from CRAN).

I rolled back my installation to 2.14.0 and the same model again produces no warnings except for the odd divergent transition (which appear not to signal any particular pathology). But, virtually every transition does reach the (default) maximum treedepth. All other diagnostics indicate satisfactory performance.

I can offer more information if needed including stan model and data if requested. Right now I’m just wondering if stan is being more aggressive with warning messages.

I’m pretty sure that I’ve yet to see a model that hits maximum treedepth but actually samples correctly. You’re saying that with 2.14 you can start ten chains from different initial values and then you get a good multi-chain split R-hat despite hitting max tree depth? There were some issues a while ago with rstan swallowing warning messages but I’m not sure if these changed from 2.14 to 2.15… @bgoodri?

It’s not that it won’t sample correctly asymptotically, but that it’ll devolve to a random walk if you don’t take the Hamiltonian simulation all the way to a U-turn. This can sometimes still work, but it’s dangerous as random walks can wind up being very biased in the finite regime.

No, 2.15 is not being more agressive with warnings, but it might have had an off-by-one error fixed so that tree depth is measured off by one. What are the tree depths you get in 2.14?

The sampling hasn’t been changed per se, but the adaptation is now not counting the initial state. So you may be adapting to smaller step sizes. What do those look like in the two situations?

Thanks to both of you.

I’m actually running 4 chains in parallel on a Core I7. A maximum likelihood estimate is easily calculated for this model and with that in hand I am rescaling the data to make the ML estimates for all parameters between 0 and 1 – nonnegativity is a hard constraint in the model and most ML parameter estimates will be zero. The ML estimates are fed to stan as inits with some jitter and with the zero bound parameters perturbed away from 0. I’ve run variations of the model with random inits that converge to the same posterior, but those take longer.

Right now I’m looking at simulated data. The model and “predictors” I’m using to generate the fake data is the model I’m fitting, so there’s no question of model mis-specification. I don’t know what the posteriors should look like in detail, but I do know what the parameter means should be provided the prior isn’t biasing things too much, which is basically what I’m investigating right now.

The run I originally queried about hit a treedepth of 10 and used the maximum number of leapfrog iterations for every transition. I re-ran the model setting the maximum treedepth to 12, and this time it hit a treedepth of 11 for every transition, so except for the fact that it took over twice the wall time everything looks good.

A couple of trace plots of model parameters first: This first one is the parameter that had the maximum ML estimate. The max_treedepth=10 run is on top. By the way this is why I got excited about Stan: no other sampler I’ve tried gets to a stationary distribution in a remotely tractable number of iterations.

OK, that’s not helpful. Discourse will only allow me to include one image per message as a new user, and I had 5 with various diagnostic plots. I guess I’ll have to break them up.

Another trace plot:



And apologies for having to split one post into 4.

I guess next I will reinstall ver. 2.15.x and see if there are any real differences in performance. And think about whether I’m worried about the sampler behavior with default control parameters.

I think I have used treedepth as high as 14 or 16 before with success so it might be worth it to push it higher, you might come out ahead in terms of n_eff/s even if the walltime/iteration is slower.

Thanks for being patient and posting. You’ll need an admin to reconfigure Discourse. @betanalpha? The admins asked us not to hijack threads discussing Discourse, so I’ll reply to the content in another post.

Great. Thanks much for following up.

You’re going to run into terrible numerical issues with a variable that is constrained to be positive, but wants to be zero (the mass is concentrated near zero). What happens is the positive parameter is log transformed to the unconstrained scale, which is where sampling is done. The problem is returning to the constrained scale, which we do by an exp() transform that is prone to rounding-to-zero errors.

Those plots on n_eff/N were really illustrative of what happens with difficult geometry. Glad Stan’s working, even if slowly.

Maybe someone has a better-behaved unconstraining transform or a reparameterization for positive-constrained values with substantial mass near zero.