Saving state and restarting sampling on cluster computer with strict walltimes

dfalster · September 19, 2017, 11:57pm

I’m running lots of rstan jobs on a linux cluster. As with most HPC environments, the administrators prefer short jobs, i.e. with walltimes < 12hr, and rarely allow longer walltimes. If you have a potentially long job, the recommended strategy is often to break it into a series of smaller jobs, with each continuing from the previous endpoint. Is there anyway to in rstan to save the state of the sampler intermittently, and restart sampling of a previously saved job? Sorry if i missed this somewhere in the manual & docs – I couldn’t find anything.

BTW - thanks for wonderful work with stan and rstan – it’s great!

Operating System: CentOS release 6.3 (Final)
Interface Version: rstan 2.16.2
Output of writeLines(readLines(file.path(Sys.getenv(“HOME”), “.R/Makevars”))):
CXXFLAGS=-O3 -mtune=native -march=native -Wno-unused-variable -Wno-unused-function
CXX=clang++ -ftemplate-depth-256
CC=clang

Output of devtools::session_info("rstan”):

devtools::session_info(“rstan”)
Session info ------------------------------------------------------------------
setting value
version R version 3.3.1 (2016-06-21)
system x86_64, linux-gnu
ui X11
language (EN)
collate en_US.UTF-8
tz
date 2017-09-20

Packages ----------------------------------------------------------------------
package * version date source
BH 1.62.0-1 2016-11-19 CRAN (R 3.3.1)
colorspace 1.2-7 2016-10-11 CRAN (R 3.3.1)
dichromat 2.0-0 2013-01-24 CRAN (R 3.3.1)
digest 0.6.12 2017-01-27 CRAN (R 3.3.1)
ggplot2 2.1.0 2016-03-01 CRAN (R 3.3.1)
graphics * 3.3.1 2016-10-24 local
grDevices * 3.3.1 2016-10-24 local
grid 3.3.1 2016-10-24 local
gridExtra 2.2.1 2016-02-29 CRAN (R 3.3.1)
gtable 0.2.0 2016-02-26 CRAN (R 3.3.1)
inline 0.3.14 2015-04-13 CRAN (R 3.3.1)
labeling 0.3 2014-08-23 CRAN (R 3.3.1)
lattice 0.20-33 2015-07-14 CRAN (R 3.3.1)
magrittr 1.5 2014-11-22 CRAN (R 3.3.1)
MASS 7.3-45 2016-04-21 CRAN (R 3.3.1)
Matrix 1.2-6 2016-05-02 CRAN (R 3.3.1)
methods * 3.3.1 2016-10-24 local
munsell 0.4.3 2016-02-13 CRAN (R 3.3.1)
plyr 1.8.4 2016-06-08 CRAN (R 3.3.1)
RColorBrewer 1.1-2 2014-12-07 CRAN (R 3.3.1)
Rcpp 0.12.7 2016-09-05 CRAN (R 3.3.1)
RcppEigen 0.3.3.3.0 2017-05-01 CRAN (R 3.3.1)
reshape2 1.4.2 2016-10-22 CRAN (R 3.3.1)
rstan 2.16.2 2017-07-03 CRAN (R 3.3.1)
scales 0.4.0 2016-02-26 CRAN (R 3.3.1)
StanHeaders 2.16.0-1 2017-07-03 CRAN (R 3.3.1)
stats * 3.3.1 2016-10-24 local
stats4 3.3.1 2016-10-24 local
stringi 1.1.2 2016-10-01 CRAN (R 3.3.1)
stringr 1.2.0 2017-02-18 CRAN (R 3.3.1)
tools 3.3.1 2016-10-24 local
utils * 3.3.1 2016-10-24 local

aaronjg · September 20, 2017, 12:28am

It isn’t possible in Stan yet, but I know the developers are working on it. In the mean time, you might look into a more generic process checkpointing solution like:

bgoodri · September 20, 2017, 1:00am

No

dfalster · September 20, 2017, 10:16pm

Thanks aaronjig interesting idea. Have you tried this with rstan?

Bob_Carpenter · October 2, 2017, 6:47pm

Not yet, but it should be available soon.

dfalster · October 8, 2017, 10:03pm

Thanks everyone for the response. Glad to hear that this feature is in dev!

KAtkin · September 4, 2020, 9:51pm

Just checking in on this. Has saving state been implemented in stan yet? Has it been implemented in brms as well? If so could someone please refer me to the reference documents for? I have not found them.

Thank you.

bgoodri · September 5, 2020, 5:37pm

Not in rstan.

jonah · September 5, 2020, 5:45pm

I forget, what are we waiting on for this? @Bob_Carpenter’s comment above from 2017 suggested that we were close to having this a few years ago. Did something derail this?

stevebronder · September 7, 2020, 8:53pm

Didn’t @bbbales2 do something like this with campfire ?

jonah · September 10, 2020, 5:57pm

@bbbales2 does campfire do this? If so how did you get that working?

bbbales2 · September 10, 2020, 6:00pm

Hmm, campfire doesn’t do stopping based on runtime, and it’s also a super experimental developmental thing that’ll probably break at awkward times if anyone used it regularly.

I think this was the last thread on checkpointing: Current state of checkpointing in Stan

I agree this would be valuable, but I don’t know of anything that does it now.

jonah · September 10, 2020, 6:14pm

So is this everything we need?

inverse metric
adapted step size
last draw before it stopped to use an initial values

If so then this should already be doable with CmdStanR, right?

bbbales2 · September 10, 2020, 6:56pm

If your calculation is killed after warmup, then yeah you ostensibly if you have those things you can rebuild everything.

If the calculation got killed in warmup, then there’s more state you need to recover to restart it. We could probably reverse engineer this around cmdstan, but my instinct would be to put it in cmdstan.

jonah · September 10, 2020, 7:10pm

Yeah you’re right about warmup. I had just been thinking about after warmup. I agree it would be better to put this in CmdStan, although if that seems like it’s not going to be implemented for a long time we could make a tutorial about how to do it manually with CmdStanR.

bbbales2 · September 10, 2020, 7:34pm

Yeah, that would be appropriate. If you want help, hit me up and we can pair program it out. Shouldn’t take too long.

I think if I did it on my own I’d just get bored and never finish it though.

jonah · September 10, 2020, 7:43pm

Haha, I was thinking the same thing. Let’s tackle it together sometime.

Chuan-Peng_Hu · March 21, 2021, 4:19am

I am curious as @KAtkin was because I am working in an environment where the power supply may be shut down and no UPS is available (though I am trying to get one).
Just wondering has it been implemented in brms, rstan, or cmdstanr?

Thanks

mike-lawrence · March 21, 2021, 7:20am

Yes, see here for cmdstanr code.

Topic		Replies	Views
RDS issues in Rstan with parallel session RStan	8	1435	November 21, 2017
Sometimes the stan stopped sampling, Rstan General rstan	1	485	February 9, 2022
Cannot save output from pairs.stanfit() RStan rstan	1	477	March 10, 2023
Stan crashes all the time General	16	2191	May 8, 2019
Stan abort, R crash, or other error when sampling relatively simple model RStan rstan	4	1483	September 26, 2019

Saving state and restarting sampling on cluster computer with strict walltimes

Related Topics