Setting a minimum effective sample size

Is there currently a way in Stan to set a minimum effective sample size for the parameters being estimated? If not do you think this could be a nice potential option to fitting with Stan?

Thanks,
Thomas

I assume you mean that you want to be able to tell Stan to keep running until every parameter has at least the minimum specified ESS. Is that right? We’ve had some discussions about similar ideas but it’s not currently possible unfortunately.

One problem with that would be that for some models (poorly written models, models that are hard to estimate, etc.) there would be no guarantee of reaching the minimum specified ESS in a finite amount of time so it would have to be more complicated (e.g. also setting a maximum amount of time or iterations). There may be other issues too that I’m not remembering at the moment.

We’ve also had discussions about possibly telling Stan to run for a certain amount of time (instead of number of iterations) but that has some difficulties too.

Regarding getting enough effective samples, this should become somewhat easier when the interfaces (e.g. RStan, PyStan) allow passing in the mass matrix from a previous run. Then if you run a model and everything looks ok but you want bigger effective sample sizes you would be able to run the same model for additional iterations without having to start over. Last I heard we were pretty close to being able to offer this but it’s not in any of the interfaces yet.

2 Likes

I really like the idea of “run until all parameters have ESS > 100 or for n minutes whichever comes first”. The mass matrix stuff makes it possible to do this at the interface level though, that might be a better place to do it anyway.

3 Likes

Thanks Jonah for the detailed response. Yes there are some caveats that I have not put much thought into myself but it’s interesting to see your point of view.

In terms of the mass matrix, I think this is a great addition to Stan. I am hoping that when implemented in RStan say that it can load the mass matrix via the sample csv files rather than loading the samples into memory.

One thing I forgot to mention is that since each Markov chain takes a
different amount of time to get to the same number of iterations, any rule
based on an amount of time (e.g. run until ESS of 1000 or X minutes,
whichever comes first) could result in a different number of draws per
chain. That’s not necessarily a problem but most of the code in the
interfaces (e.g. rstan) and post processing packages (e.g. shinystan,
bayesplot, loo) assumes an equal number of iterations per chain. So we
would have to make everything compatible with ragged structures. That
wouldn’t necessarily be so difficult, it would just take time.

A few decision rules like - run until min ess, then equalise chains, while time x not exceeded - would be very helpful and on the face of it not too complex…

Yes of course, to be compatible with other packages is important.

One could hack it together with subprocess (python) + cmdstan + following csv outputs (python). Then just kill the process when “ready”. Essentially needs to use multiprocess (python) to run csv algorithm and subprocess parallel.

We’ve put a lot of thought into this, but haven’t come up with any good ideas for “run until ESS > N”. We’re not really even sure how to most effectively run for time t.

Do you have an implementation in mind?

The ESS calculations are expensive as they stand now. Stan does split R-hat to calculate variance then combines with an FFT over each chain. The split R-hat makes it impossible to do R-hat online and I don’t know if FFT can be done online.

We also don’t know how long to run warmup; what we’d want to do is terminate warmup as soon as the post-warmup chains mix well and all chains have converged to roughly the same adaptation parameters. With lots of parallelism, we’d have many more options. But what if we’re on a notebook with two cores? We now have to interleave mutliple chains over fewer cores.

See my response to @dlakelan above.

If you have a concrete algorithm to suggest, we’d be very eager to hear about it.

How expensive are you talking Bob? Something substantially more than something like the summary function in Rstan? For something that takes hours or days, collecting the chains and running that calculation every x iter doesn’t strike me as significant. But yes, I’m just a naive outsider! ;) Once the ability to pre-specify the mass matrix is finalised / exposed, I’d be happy to do some experimenting with such things in R…

Yes, we’d need to do the expensive computations of the summary function every iteration (just R-hat and n_eff—we wouldn’t need to compute means, MCMC standard errors, or quantiles). This will take much longer than the iterations themselves in most cases, though I could see it being useful for jobs that take days and are really slow per iteration for reasons other than having gazillions of parameters.

If we only measured R-hat, say, and we were willing to use the non-split version, that we could calculate online relatively efficiently. But I don’t know how we could do effective sample size; maybe there’s an approximate online algorithm for that, too.

None of this has to be exact for it to be useful.

I think this is the key. From a practical perspective, the reason to do an ESS / time limit is because you expect things to take a while, say more than 5 minutes. So from this perspective you could probably make something like 3 minutes be the minimum time that this algorithm would accept as input. This makes sense because in many models I have, it takes 30 seconds to 2 minutes to compile the thing anyway. The people who really want this feature are the ones who are looking at tens of minutes to tens of hours or more.

Second, how bad is it really if you overrun the ESS requested by a bit? Not really that bad. I mean if I ask for ESS = 100 and you give me 131 I’m not going to complain unless that 31 took 20 minutes to several hours to generate. Even then, if without the ESS/time feature I’d have asked for say 1000 iterations because I have no idea what my ESS is going to be and you stop at ESS 131 after 500 iterations, even though I had to wait for those 31 extra, the truth is I just saved a day of computing.

So I think one of the things to do if you try to do this is to be very explicit about the approximate heuristic nature of this calculation so that you manage the expectations of the user.

I’m not sure how best to tune the warmup time, so let’s assume that problem is solved somehow. Next we need to figure out how long to run the iterations. I suggest first looking at the time taken per iteration and figuring out how much budget there is for iterations. So, as each iteration completes you keep track of how much time it took, then after a few iterations (post warmup) have completed you extrapolate linearly to figure out how many samples you can get by the end of your time budget. If this is within a factor of 1.5 or 2 of the ESS, then just run to the time budget.

If the time budget exceeds 2 or 3x the iteration count, you now need to estimate how many ESS you’re getting per iteration. Run until you get iter = ESS requested. You can’t get ESS more than iter count, so you certainly never need to do the ESS count for the first batch of samples. Then after you hit that ESS count, do one estimate of the ESS with your fourier transform etc. Now use iter/ESS(iter) * ESS_requested to estimate the iters you need. If this exceeds the time budget, run to the time budget. If this doesn’t exceed the time budget, run to that iteration and re-do the ESS calculation. If I ask for ESS=100 you’re basically doing one ESS calculation every O(100) iterations at most.

In most cases you’d probably do the ESS calculation 2 or 3 times per run at most until you zeroed in on the actual iter you need to get the ESS requested.

1 Like

That’s exactly why we’ve been thinking about it.

Tuning

That’s the harder problem. I’m not sure how to do it either. With parallelism, wall time can be greatly reduced. But that may defeat the purpose here.

This is tricky as there can be a huge amount of variation in time per iteration based on where you’re at in the parameter space. But we can certainly estimate means and variances and get some statistical handle on what to expect.

Keep in mind we are going to want to do this with multiple chains. That gives you some handle on monitoring convergence of warmup if you’re willing to think about cross-chain communication (not something Stan currently supports).

As I tell everyone, if you can figure this out, we take pull requests!