Saving state and restarting sampling on cluster computer with strict walltimes

I’m running lots of rstan jobs on a linux cluster. As with most HPC environments, the administrators prefer short jobs, i.e. with walltimes < 12hr, and rarely allow longer walltimes. If you have a potentially long job, the recommended strategy is often to break it into a series of smaller jobs, with each continuing from the previous endpoint. Is there anyway to in rstan to save the state of the sampler intermittently, and restart sampling of a previously saved job? Sorry if i missed this somewhere in the manual & docs – I couldn’t find anything.

BTW - thanks for wonderful work with stan and rstan – it’s great!

Operating System: CentOS release 6.3 (Final)
Interface Version: rstan 2.16.2
Output of writeLines(readLines(file.path(Sys.getenv(“HOME”), “.R/Makevars”))):
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXX=clang++ -ftemplate-depth-256
CC=clang

Output of devtools::session_info("rstan”):

devtools::session_info(“rstan”)
Session info ------------------------------------------------------------------
setting value
version R version 3.3.1 (2016-06-21)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
tz
date 2017-09-20

Packages ----------------------------------------------------------------------
package * version date source
BH 1.62.0-1 2016-11-19 CRAN (R 3.3.1)
colorspace 1.2-7 2016-10-11 CRAN (R 3.3.1)
dichromat 2.0-0 2013-01-24 CRAN (R 3.3.1)
digest 0.6.12 2017-01-27 CRAN (R 3.3.1)
ggplot2 2.1.0 2016-03-01 CRAN (R 3.3.1)
graphics * 3.3.1 2016-10-24 local
grDevices * 3.3.1 2016-10-24 local
grid 3.3.1 2016-10-24 local
gridExtra 2.2.1 2016-02-29 CRAN (R 3.3.1)
gtable 0.2.0 2016-02-26 CRAN (R 3.3.1)
inline 0.3.14 2015-04-13 CRAN (R 3.3.1)
labeling 0.3 2014-08-23 CRAN (R 3.3.1)
lattice 0.20-33 2015-07-14 CRAN (R 3.3.1)
magrittr 1.5 2014-11-22 CRAN (R 3.3.1)
MASS 7.3-45 2016-04-21 CRAN (R 3.3.1)
Matrix 1.2-6 2016-05-02 CRAN (R 3.3.1)
methods * 3.3.1 2016-10-24 local
munsell 0.4.3 2016-02-13 CRAN (R 3.3.1)
plyr 1.8.4 2016-06-08 CRAN (R 3.3.1)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.3.1)
Rcpp 0.12.7 2016-09-05 CRAN (R 3.3.1)
RcppEigen 0.3.3.3.0 2017-05-01 CRAN (R 3.3.1)
reshape2 1.4.2 2016-10-22 CRAN (R 3.3.1)
rstan 2.16.2 2017-07-03 CRAN (R 3.3.1)
scales 0.4.0 2016-02-26 CRAN (R 3.3.1)
StanHeaders 2.16.0-1 2017-07-03 CRAN (R 3.3.1)
stats * 3.3.1 2016-10-24 local
stats4 3.3.1 2016-10-24 local
stringi 1.1.2 2016-10-01 CRAN (R 3.3.1)
stringr 1.2.0 2017-02-18 CRAN (R 3.3.1)
tools 3.3.1 2016-10-24 local
utils * 3.3.1 2016-10-24 local

1 Like

It isn’t possible in Stan yet, but I know the developers are working on it. In the mean time, you might look into a more generic process checkpointing solution like:

No

Thanks aaronjig interesting idea. Have you tried this with rstan?

Not yet, but it should be available soon.

1 Like

Thanks everyone for the response. Glad to hear that this feature is in dev!

Just checking in on this. Has saving state been implemented in stan yet? Has it been implemented in brms as well? If so could someone please refer me to the reference documents for? I have not found them.

Thank you.

Not in rstan.

I forget, what are we waiting on for this? @Bob_Carpenter’s comment above from 2017 suggested that we were close to having this a few years ago. Did something derail this?

Didn’t @bbbales2 do something like this with campfire ?

1 Like

@bbbales2 does campfire do this? If so how did you get that working?

Hmm, campfire doesn’t do stopping based on runtime, and it’s also a super experimental developmental thing that’ll probably break at awkward times if anyone used it regularly.

I think this was the last thread on checkpointing: Current state of checkpointing in Stan

I agree this would be valuable, but I don’t know of anything that does it now.

1 Like

So is this everything we need?

  • inverse metric
  • adapted step size
  • last draw before it stopped to use an initial values

If so then this should already be doable with CmdStanR, right?

1 Like

If your calculation is killed after warmup, then yeah you ostensibly if you have those things you can rebuild everything.

If the calculation got killed in warmup, then there’s more state you need to recover to restart it. We could probably reverse engineer this around cmdstan, but my instinct would be to put it in cmdstan.

2 Likes

Yeah you’re right about warmup. I had just been thinking about after warmup. I agree it would be better to put this in CmdStan, although if that seems like it’s not going to be implemented for a long time we could make a tutorial about how to do it manually with CmdStanR.

Yeah, that would be appropriate. If you want help, hit me up and we can pair program it out. Shouldn’t take too long.

I think if I did it on my own I’d just get bored and never finish it though.

1 Like

Haha, I was thinking the same thing. Let’s tackle it together sometime.

2 Likes

I am curious as @KAtkin was because I am working in an environment where the power supply may be shut down and no UPS is available (though I am trying to get one).
Just wondering has it been implemented in brms, rstan, or cmdstanr?

Thanks

Yes, see here for cmdstanr code.

2 Likes