Mysterious treedepth issues

I have a complicated model I’m trying to debug, but sparing you the details of the model for the time being, this project has brought up a more general question about model diagnostics and posterior geometry.

This question is:
What can cause treedepth issues, given that a pairplot of the parameters looks almost like a set of independend normal distributions?

I mean, usually if there’s a treedepth issue it’s either because the curvature is very variable, or because of some kind of nonidentifieability. Both of which should show up in a pairplot. Of course, if the problematic structure is rotated in such a way in the parameter space that it can’t be properly visualized by using just two of the posterior dimensions at a time (as a pairplot does), than such a thing might be present in the posterior, yet undetectable with the usual tools.

Maybe I have such a case?

So my question extends to the following:
How can I check for such rotated problematic geometries?
Is there anything else that could cause treedepth issues that isn’t necessarily visible in a pairplot?

I’d LOVE to be able to see the actual paths of the leapfrog steps. Wonder if that would be of help?

Just as an example, here’s the pairplot of my model:

The pairplot includes all scalar parameters and one dimension of each vector-parameter. The other dimensions look similar, althout some are not quite as nicely behaved (but still fine, I think).

And since Stan doesn’t put out the internal representation here’s a transformation of the samples to what i think the internal representation looks like. For some parameters I used off-center parametrisations, so that’s also shown here.

For some of the parameters I put in temporary very tight gaussian priors, expecting that this would solve the treedepth issue and allow me to narrow in on it’s cause. This didn’t work, however, as you can see in the plot: It’s now all nice and gaussian but still has treedepth issues.
I capped treedepth at 9, so you can see in the plot that every iteration hits the maximum treedepth. Given how gaussian and mostly uncorrelated the pairplot looks, treedepth should be way lower. There is also no variable that is correlated with lp__, so I think this probably rules out a neal’s funnel type problem.
The pairplot includes the iteration number as a variable, so you can see the traceplots of the parameters. For some parameters the trace looks very ugly.

My question is in principle independend of any particular model, but, just as a background, here’s some info about that particular example model just to give you an idea about what type of model I’m having these problems with:
This model has a log-link-function and a poisson distribution. It has an observation level error term (called “datapointEffect”) with its dispresion parameter beeing “datapointSd”. It has two gaussian procces terms, one beeing used as a nonparametric predictor of something that varies over time (over two non-connected timeframes, parameters with the word “season” in them), the other one to account for temporal autocorrelation. It also has a single parametric predictor (“yearEffect”).

I already tried all sorts of reparametrisations (e.g. combining the observation level error term with the autocorrelation gaussian field), but so far nothing solved or even just enabled me to find the issue.

Which interface version are you using? Before Stan 2.21 (last October release, so far only cmdstan, I think) there was a bug that caused higher-than-necessary treedepth in gaussian models. If it’s related to that, increasing adapt_delta should help.

If you are using the poisson_log_glm function, there is a bug in that which caused my model to have tree depth problems.

1 Like

I’m using rstan.
Do you have a link to that bug?
But it’s not just that NUTS chooses a too high treedepth, the mixing is also really terrible. This indicates to me that there must be some issue with either the posterior geometry, or the adaptation.

Hm, thanks for the link!

My model uses y~poisson(exp(LinearPredictor)). After reading up in your link I tried y~poisson_log(linearPredictor) and target+=poisson_lpmf(y|exp(linearPredictor)). Neither worked, so I don’t think the bug described in your post affects my model.

Well, debugging models is hard :-( I generally find big models almost non-debuggable. The strategy that has worked best for me so far is to start with a simple model, debug it really well (e.g. via Simulation-based calibration) and add additional structure one by one, always doing heavy checks. This way you notice when the model starts to misbehave.

If you haven’t read it yet, some additional visualisation options (trivariate plots, parallel coodinates plots) are discussed in bayesplot vignette https://mc-stan.org/bayesplot/articles/visual-mcmc-diagnostics.html and my post on https://www.martinmodrak.cz/2018/05/14/identifying-non-identifiability/ . Often, playing with ShinyStan is helpful.

Also some of your variables might be slightly bimodal (e.g. AutoCorrelationEffectOfCenter)- look at the individual chains, do they mix slowly or do they stey separate?

I did tests with simpler versions of the model, and alsoused trivariate and tetravariate (using colour) and parallel-coordinate plots. But didn’t find the cause. That’s why I started to wonder where there’s something else at work than the usually discussed problems. But of course you still might by entirely right. Maybe it’s just the usual stuff, well hidden by the models complexity.

So, I guess your answers to my two questions (after expansion of the main question) are basically:

  1. “Difficult, unless you have an idea of the angle of rotation.”
  2. “I don’t know of anything else.”
1 Like

Maybe @betanalpha has some ideas?

Running with “diagnostic_file=<something”> with create a CSV with the sample values in the unconstrained space on which that Stan actually samples. If you have lots of constrained types then you’ll want to be investigating this unconstrained space.

Saturated tree depths just means long numerical trajectories. Those can be caused by either long integration times, consistent with an unfortunate geometry, or small step sizes/poor inverse metrics, consistent with poor adaptation.

To distinguish the two possibilities it will help to investigate the adaptation information. In RStan you can do

sampler_params <- get_sampler_params(fit, inc_warmup=FALSE)
stepsizes <- sapply(1:4, function(n) sampler_params[[n]][,'stepsize__'][1])

or

get_adaptation_info(fit)

to get the step size and inverse metric components in an unfortunate string format that you’ll have to convert into something more manageable yourself. I recommend histogramming the inverse metric components to get a feel for their distribution.

It’s also hard to speculate without knowing all of the diagnostics. What is the output of

util <- new.env()
source('stan_utility.R', local=util)
util$check_all_diagnostics(fit)

The stan_utility.R file is available at https://github.com/betanalpha/knitr_case_studies/blob/master/ordinal_regression/stan_utility.R.

Trying to pin down parameters by adding narrow gaussian densities will only make the posterior geometry worse as the sampler will struggle to find the small region of parameter space consistent with the narrow densities. That slow relaxation is also likely to stress adaptation even further.

3 Likes

Ok, thanks so far. It’ll take some time to ingest these suggestions and decide on a way forward.

Hm, not sure how it’s supposed to look like, but I’ll think about it.

Hm, that’s unfortunate. Although I had the impression that in this case the sampler easily finds the mode within the first few 100 iterations, at least for the variables treated in this way. Shouldn’t that mostly eliminate the issue since there are enough following warmup stages for the sampler to recover from that?

Samplers don’t find modes, they find and then explore typical sets (which are usually around the mode in a given parameterization).

When you introduce strong priors to “pin” a parameter down you run into a few issues. Firstly if you can’t initialize close to the desired values then the sampler will have to struggle in the tails of that pinning prior. In general the gradients will be huge and the step size will have to drop to ensure accurate integration of the Hamiltonian trajectories used in Hamiltonian Monte Carlo. This behavior is only exacerbated when you have some pinned parameters, which will have large gradients out in the tail that require a small step size, and some unpinned parameters which will require a much larger step size to see any exploration. With a default maximum tree depth the resulting trajectories might also be cut off way too early, leading to inefficient random walk behavior that prevents the sampler from going very far. Once you consider dynamic adaptation of the sampler configuration as well things can get very bad.

1 Like

Oh, jeah. Bad wording on my part.

In the rest of you answer you bring up some interesting points, which I think I probably haven’t heard before. Could be something useful to add to the documentation / tutorials / case studies / etc.
I’ll experiment with this aspect of the adaptation behaviour. Thanks for the explanation!

See https://betanalpha.github.io/assets/case_studies/probabilistic_computation.html and https://betanalpha.github.io/assets/case_studies/markov_chain_monte_carlo.html.