I am using Stan for estimation of some models on my university’s supercomputer. The supercomputer only supports rstan as an interface (cmdstanr is not installed on the server).
The problem arises because the maximum time limit that I can request from the supercomputer is 10 days. My current model got timed out and taking into account the estimation speed that I could see so far, I think around 15 days would be needed for the estimation.
I was wondering if it is anyhow possible for me to break this down into several pieces in order to adjust to the imposed time limit by the supercomputer. Can I first do the estimation with less iterations in the first 10 days and then somehow pass this posterior object to Stan and run the new job on the supercomputer with the remaining iterations?
I hope I explained the matter clearly, but if not, please feel free to ask questions so that I can explain it more thoroughly. Any help would be appreciated.
I’ll preface with the caveat that it might well be possible to find sufficient speedup by optimizing your Stan code to avoid your problem entirely. I know you’ve recently developed some fairly well-optimized code (e.g. Help with vectorizing for loops), but also that you’ve asked several more recent questions. If you can avoid this problem by optimizing your code, do that!
Additionally, for what it’s worth, I’ve always managed to convince university clusters to install
cmdstan, and in your position I would pursue that until it became obvious that it wasn’t going to happen.
With that said, you can in general break down the estimation into several shorter runs, but only after warmup is complete. Once warmup is complete, you just need to extract the step-size, the inverse metric, and the last iteration, and you can re-start sampling with warmup turned off, explicitly passing the last iteration as inits, and the step-size and inverse metric as algorithmic parameters.
Note that because treedepths are typically much deeper early in warmup than later during sampling, it is not impossible that a chain that takes 15 days to complete spends 10 days in warmup. If you cannot fit warmup into your time limit, then things get tricker, but can still be made to work. The trick is to run a shorter warmup that does fit into the time limit, followed by one sampling iteration. Then, extract the inverse metric, the step-size, and the posterior sample, and pass these back to a new run with warmup still turned on, but with a longer
adapt_window. To get this right, you’ll need to understand how the windowed phase of adaptation works and choose a reasonable
adapt_window based on how many warmup iterations you’ve been able to run so far.
Thank you very much for your detailed response and sorry for my delay in responding, I got caught up in some things. Your answer is very helpful. Fortunately my warmup fits into the 10 days, so I can luckily just go with the solution you provided in your third paragraph. However, I have additional problems that I will ask a separate question about.