I’ve been fitting a large model using NUTS. I find that the maximum treedepth is often exceeded during the warmup stage, for example, reaching treedepths of 17 and 18. Obviously, this makes running the model exceedingly slow.

Surprisingly, I notice the model doesn’t need a large treedepth when sampling - the treedepth never exceeds 11 during sampling. This suggests to me that the posterior geometry is ‘well behaved’, but it seems that NUTS is having some trouble adapting.

What would be best practice in this case? Perhaps limiting the treedepth to a reasonable during warmup, but not during sampling? Though, this could lead to worse adaptation. I’m not sure what reparameterisations could be helpful for this case.

Hi and welcome. A couple of things to will help folks troubleshoot your problem here:
Can you post the model? And the model call to run it?
Can you share a snippet of the data (for running)? Or fake data for folks to play with?
Is this in R? Python? and if so what versions?
Does the model run with fake data and known parameters? And can you recover those parameters?

If the sampling behavior after the warmup phase is fine then there are two possible problems. The first is that the geometry outside of the target typical set is nasty and Stan’s dynamic Hamiltonian Monte Carlo sampler has to really work with long trajectories to get through that nastiness and find the target typical set. The other is that the default tuning of the step size and inverse metric elements are poor and the early exploration of the target typical set that informs the adaptation is necessarily slow.

What does the distribution of inverse metric elements look like? If there are strong variations then it would help to rescale your parameters so that the posterior lengthscales are more uniform and the initial tuning is less bad.

In my case, I found when looking at the elements of the (diagonal) inverse mass matrix, the values spanned several orders of magnitude. Rescaling parameters with mass matrix values far from the common values (e.g., multiply and dividing by constants) significantly improved the adaptation and reduced the treedepths required during warmup.