Setting Max Treedepth in difficult high-dimensional models

I’ve been operating on the principle that @betanalpha 's new Virial based sampling algorithm would do a reasonable job of choosing a sampling length, and that if I hit the tree-depth, adding additional depth would be a good thing.

On the other hand, I’ve got an 8000 dimensional model using thousands of simplexes, and it was happy to simply go on and on for treedepth of 14 or more, getting even 100 samples was taking a day or whatever.

This afternoon I decided to actually reduce the treedepth to 7 and that seemed to get me into the typical set in less wall time, increasing to treedepth 9 seemed to get me into the typical set even faster. step-size is about 4 times as large in treedepth 9 as in treedepth 14. in all, it seems I’ve improved efficiency of this model at least, by reducing treedepth. we’ll see how good the rhats and traceplots are after I finally get my samples.

Is there a general kind of advice that’s applicable here? under what circumstances is increasing or decreasing treedepth in order to force shorter or longer trajectories better?

thanks

Note, Rhat was generally better after pulling 200 samples from treedepth 9 than with treedepth 14, and it took about a half hour rather than half a day.

Of course, it also showed that the model needed work, because it wasn’t matching a global estimate with a global observed value very well, so I’m now adjusting it and rerunning. again with treedepth 9, and expect it to take less than an hour. watching the CSV files already shows it was in the typical set with stable lp__ in about 20 samples, taking about 3 or 5 minutes, compared with upwards of tens of minutes for just 1 sample at treedepth 14

I’ve had some similar experiences, seemingly because until the sampler has some ‘semi reasonable’ adaptation, the treedepth can blow out to high values. Some kind of treedepth adaptation during warmup might be helpful? I reduced the issue in my case somewhat by adjusting the adapt_init_buffer and adapt_window parameters to much lower values, like 10 and 5, so adaptation happens more quickly. Apparently such changes can pose problems, but I have not experienced any, so far as I’m aware!

Increasing the tree depth should always improve effective sample size per iteration and hence artificially decreasing it is not recommended. Often what happens is that by increasing the tree depth threshold allows the sampler to explore more completely during adaptation (you can see if your adaptation was being limited by looking at the tree depth of the warmup iterations) which then affects the variance/metric element estimates and then the final adapted step size. In other words, you were missing some nasty valley in your posterior initially, adapted to only what you were seeing, and getting fast but biased exploration (often you’ll see divergences, but you may need to run longer for the chain to get close enough to the entrance of the valley for divergences to manifest). By increasing the tree depth threshold adaption is more accurate and you get the smaller step size that you need to accurately explore.

The fact that you’re seeing misfit is a good indication that your model is much harder to fit than you had initially thought and that the adaptation was indeed doing the right thing. It’s really, really hard for the adaption to end up super conservative (low step size, long trajectories) when it’s not needed.

Almost all of these models wind up with long tails somewhere. Like this one, it’s looking at alcohol consumption, and it seems that on order of 99% of all people consume 6 or more alcoholic beverages in a day at rates approaching 0 days per year (there is a ginormous spike at basically 0+epsilon days/yr) and then 1% of people consume between 7 and 25 alcoholic beverages every other day (!!!)

The “misfit” I meant was that the estimate of total per capita consumption wasn’t matching what is coming out of the tax records for sales, so it wasn’t a “misfit” in the sense of the model, but a mis-fit between my estimation method and someone else’s based on different data.

The thing is, the short treedepth=9 exploration showed essentially identical distributions of beverage consumption as the treedepth=14 exploration. It gave me qualitatively identical graphs, and essentially the same global estimates of consumption, and the like, except it took 20 minutes instead of 20 hours. Now one thing is that in some sense almost all 8000 parameters are nuisance parameters (they describe individual behavior). The thing I really care about is a series of 7 kernel-density plots for the population distribution of these parameters.

If I get 7950 of them to be well mixed and 50 of them are kinda stuck… it doesn’t really change the population level estimates of interest, and these things have long tails, so it’s surprising if out of 8000 none of them got stuck…

A similar thing occurred with the last analysis, which was survey analysis of financial expenses for 2500 microdata regions. Again, a bunch of nuisance parameters that only matter because they lead to a population estimate within each region.

Typically in these problems what happens is my stepsize crashes down, the initial 10 or 20 iterations or so finally make it to the typical set after 160,000 gradient evals and an hour of computing, then I max out my 14 treedepth for every iteration after that until it hits my requested iterations and stops. the mass matrix NEVER winds up different from 1 on the diagonal, so adaptation isn’t even really occurring.

I find the whole thing confusing. But, especially for model-building… where VB is not good enough to test the model, limiting the treedepth may get me to the point where I can assess the quality of the model about an order of magnitude or two faster than long-treedepth exploration. For models I don’t necessarily believe yet, it makes no sense to me to do a whole day of computing, only to find out that I should have realized I needed to add a parameter describing effect foo… or collapse individual level parameters down to populational average ones, and run the whole thing again.

Once i get a model I believe, I can certainly see that re-running with very long treedepth might make sense to avoid anything screwy occurring. But I still don’t understand why so often my treedepth blows out and my step-size crashes… maybe I just like to fit wacky-hard models (actually, yes I know this is true)

It seems like there’s a definite tension between the needs of the adaptation phase, and the needs of the estimation phase.

Suppose that adaptation is really important because there are large differences in scale between various parameters. For example, some simplex might have an entry that is 0.01 ± 0.00001 and yet some parameter in terms of say dollars as a fraction of some important value might be normal(1,3)

Now, if I set treedepth very high, in a fixed wall time, we will use up thousands of leapfrogs just to get into theh typical set, I won’t allow a budget for many samples, and at best we do 1 stage of adaptation after hundreds of thousands of function / gradient evals. I basically don’t get any benefit from NUTS here.

Now suppose I set treedepth = 6 and small adapt_window, I can now set a budget for say 400 or 1000 samples, each sample will require 100 times fewer function evals, and I can do several rounds of adaptation… but once I finally get into sampling mode, with decent adaptation… I might well prefer to run longer treedepth.

as it is so far, I’ve been doing long treedepth, but then only giving a budget of 100 or 200 samples, and the adaptation never does anything for me. after hours of computing, I get terrible results because things didn’t adapt right.

1 Like

Ok, further information of interest:

By setting the adaptation windows and things smaller, I get some adaptation, and Stan spits out an inverse mass matrix that is not all 1’s

but what is it? The entire vector is positive numbers ranging from typically around 0.2 to 3.3 but with the occasional few dimensions having numbers like 100 or 300

If I’m reading this correctly, the inverse mass matrix is 1/var(x) which means something like 300 implies a sd of 1/sqrt(300) = 0.06 which could well be why it winds up with a step-size of 0.002 or the like.

one entry was 4.6e-7 meaning the sd of this variable was estimated at 1500?

In the case where no adaptation was occurring, because the windows weren’t big enough etc, I could easily see that if we’re doing step sizes of say 0.002 at velocities O(1) and need to travel a distance of 1500 in one dimension… well obviously that’d take millions of steps. Getting some adaptation here is obviously critical. cutting the treedepth down and reducing adaptation windows etc helps to get adaptation, but it doesn’t necessarily do what I want in the main sampling portion.

I think ideally, I’d do a run where I just did all warmup, read off the adaptation matrix and the step sizes, use the adaptation from all chains to get a single estimate of the metric matrix and step-size, and then plug those in, choose the last sample from the warmup as startup, and continue from there. Obviously this won’t be easily doable until we can reuse mass matrices.

How can I figure out which mass matrix entry corresponds to which variable? I have 8000 of them so it’s not convenient to just look at the csv files by eye. does the csv file report an entry for transformed parameters? For sampling parameters like stepsize?

actually, I may have been wrong about the 4.6e-7, that might have been the value of one of the parameters, it’s hard to read the file via less when each line has 8000 numbers in it.

They’re in the same order as the parameter dump in the header. They are the unconstrained parameters—no transforms or generated quantities as those don’t come into play in the algorithm. This is a very important point—Stan is doing sampling just over the unconstrained parameters—everything else is derived, inculding the constrained parameters. So the mass matrix is giving you an estimate of mean global posterior covariance.

This is a good point. We should probably roll something like this into the algorithm itself. I find starting with a small stepsize (the stepsize control parameter to the sampling algorithms is just for the initial step size if there’s adaptation).

Bob, currently we pre-specify particular iteration counts, and particular step-sizes and particular adaptation windows, and all that. That’s better than nothing in terms of tunability, but I do think there’s room for specifying things that are more directly our goal. Even maybe a decision theory based specification of what to do:

there’s a trade-off between time taken, and effective sample size that is always in play. When I’m debugging models, I might be very happy with a mildly biased effective sample size of 5 but I want it immediately so I can figure out whether I did something stupid. Then, when I have a debugged model, I want a completely unbiased effective sample size that’s the largest I can get, but probably the value of more samples is increasing only like sqrt(N), however the cost I’m going to impose on a certain wall clock time is nonlinearly increasing… for example if I’m in a situation where I want to start my simulation and go home over night and come back in the morning to see what I got, the cost to run until 8am tomorrow might be very little, but the cost after that per hour would be very high…

ideally we could have some simple syntax to specify these things, and then you could tell stan “go and do what I want” and it’d adapt its behavior to give you the optimal tradeoff between wall clock time and effective sample size…

Obviously, for people who are using Stan on small 8-schools type models, this is all irrelevant, compiling the Stan model takes 8 times as long as sampling which is done before you can get up to get coffee… but if you’re working on some high dimensional model on hundreds of thousands of survey data points, or an ODE solver based thing… then it can be the case that taking time to express the tradeoffs and having Stan do something smart… that makes a lot of sense.

1 Like

@betanalpha: the treedepth obviously improves how far the trajectory goes, and decreases serial correlation, increasing effective sample size. On the other hand in these kinds of models where I’m seeing this behavior, the number of evals per iteration is directly 2^D for D the treedepth (I haven’t tried treedepth beyond about 15 but it’ll use all of treedepth all the way up to 15), so unless the effective samples per cycle increases exponentially, the effective samples per wall clock time is decreasing.

Obviously, with divergences, it’s no good getting more non-good samples, but that’s not necessarily the issue. More rounds of adaptation with shorter treedepth seems to actually improve the dynamics because the metric is converging more quickly. Though this model does seem to have some issues and still does produce divergences at lower adapt_delta.

This model is really kind of weird. It seems to be very highly identified in some sense, and this seems to make it hard to fit, because it takes a long time to get out of the tails into the typical set, and then once in the typical set lots of the parameters are tightly identified and so have very small variances.

Hre’s a history of the initial part of the latest run on one chain:

lp__ accept_stat__ stepsize__ treedepth__ n_leapfrog__ divergent__ energy__
-970133 1 0.0003125 4 15 0 1.65568e+06
-970133 0 0.219034 0 1 1 977046
-970133 0 0.0239747 0 1 1 977181
-832784 1 0.00144576 2 3 0 958642
-815724 1 0.0011682 4 15 0 839511
-798189 1 0.00108418 6 63 0 822672
-380986 1 0.00109763 6 63 0 805143
-267794 1 0.00117639 6 63 0 388006
-122322 1 0.00130997 6 63 0 274830
-74943.8 1 0.00149718 6 63 0 129316
-11485.1 1 0.00174177 7 127 0 81819.8
-1107.39 1 0.00205077 7 127 0 18464.5
5957.74 0.999999 0.00243382 7 127 0 8048.97
35838.4 1 0.0029029 7 127 0 1052.49
41161.2 1 0.00347242 7 127 0 -28897.1
51943.2 1 0.00415923 7 127 0 -34086.3
54819.5 0.99359 0.00498287 7 127 0 -44982.5
55331.6 0.999656 0.00585012 7 127 0 -47698.5

as you can see lp__ starts at -970133 and heads up until it gets to 55000 or so where it’s in the typical set. It’s kind of inconceivable to me that overall the probability density in the typical set is like exp(1e6) higher than the starting point… yikes! If I tell it to do treedepth 10 or 15 it will literally take a couple days to get out of the starting point into the typical set, though it will happen in 2 or 3 fewer iterations.

Andrew’s been lobbying for the “run for 8 hours” mode for many years now. We keep saying we’ll add it with Stan 3. One thing we now have the infrastructure in place for is saving warmup and reusing. That’ll let you take a warmup and then keep drawing from the posterior at what should be a decently well estimated rate of effective sample size per unit time.

I’ve often wondered if that would be the case. For adaptation, we mainly need to learn about (co)variance of the parameters.