I am running on Ubuntu 20.04 with R version 4.0.2 and rstan version 2.19.3 installed as a debian package. I have been running rstan very successfully but have noticed that the size of the rstan::stan_model compiled files when saved as .rds files is about 20MB. Up until now I have not defined ~/.R/Makevars.
Before defining Makevars I was getting a different segfault with one model, which made me want to use Makevars. That was an “invalid permission” segfault.
Would you have an example/fake dataset that I can test the code with? If I can reproduce the segfault then it would be a code issue, but if not then there could be a configuration issue to track down
Hi @harrelfe . The (un)serialization processes is very brittle with Stan programs. Basically, if you are just working interactively, the best way to do it is to call
so that it saves a good .rds file in the same directory as lrmconppo.stan and uses that as long as lrmconppo.stan does not change. But putting the compiled code into an R package for others to use is another kettle of fish.
I have had bad luck with auto_write=TRUE when writing long reports, with rstan frequently recompiling code when it hasn’t changed.
Up until now I have not had any problem with restoring rstan models with readRDS(). But the serialiation problem you mentioned may explain the problem below.
Here is a self-contained test that works when I use the stan_model object directly but not when I write it and read it back in (segfault, memory not mapped). Earlier today I could not get the second model to faiil unless the first model was included in the script before it.
require(rstan)
options(mc.cores = parallel::detectCores())
dat1 <- readRDS(url('https://hbiostat.org/attach/dat1.rds'))
s1 <- readLines('https://raw.githubusercontent.com/harrelfe/stan/master/lrmppo.stan')
m1 <- stan_model(model_code=s1)
saveRDS(m1, '/tmp/m1.rds')
m1r <- readRDS('/tmp/m1.rds')
identical(m1, m1r) # FALSE
g <- rstan::sampling(m1r, data=dat1) # works for m1
d <- readRDS(url('https://hbiostat.org/attach/bcppodat.rds'))
s2 <- readLines('https://raw.githubusercontent.com/harrelfe/stan/master/lrmconppo.stan')
m2 <- stan_model(model_code=s2)
saveRDS(m2, '/tmp/m2.rds')
m2r <- readRDS('/tmp/m2.rds')
identical(m2, m2r) # FALSE
f <- rstan::sampling(m2r, data=d) # works for m2
I ran this with the compiler flags in .R/Makevars listed earlier, which produced small .rds files. The program also bombs when I remove those flags and get the large .rds files.
What is a failsafe way to store compiled Stan code in R?
The only failsafe way is to package it. But for use on one computer, auto_write = TRUE works well as long as the version of rstan has not changed since it was serialized. If rstan has changed, then you need to delete the .rds files and call stan_model again with auto_write = TRUE. Also, in RMarkdown files, putting cache = TRUE in the chunks that do MCMC works reliably.
I’ve tried both of those methods without success. cache=TRUE creates a 1GB cache file for a long report, and auto_write has not been reliable for me (sensing changes that aren’t there). I see now how rstanarm does this in its stanmodels.R function. I assume that is the “package it” approach. I hope that will work without the other complexities of rstanarm/src. It would be simpler if there were reliable save and load functions that work with arbitrary objects. I wonder how R saves exact images of objects when packages are compiled.
That is helpful but seems to apply to packages that make rstan mandatory. In the rms package, rstan is optional and I have users run rms::stancompile() one time to compile all the Stan code (in Github) and store each object in an .rds file.
That is true. But users will have to install a C++ toolchain to call rms::stancompile, which many of them won’t be able to do in order to serialize .rds object. And once they do that, it will recompile if any of the following triggers
which are
non-existence of the .rds file
the .rds having a modified time before rstan was released
the .rds being not valid
hash mismatching the .stan code in the .rds with the current .stan code
Thanks for that. I’d like to keep it simpler than that. As a side question when I create a stanmodels object as in the rstanarmstanmodels.R code, which uses rstan::stanc, I git an object that is too small to have contained any compiled code, and I get an error when running rstan::sampling on an element of the stored object:
Error in get(paste0("model_", model_cppname)) :
object 'model_lrmconppo' not found
It’s confusing because I can’t find anything in rstanarm code that indicates that the stanmodels object is augmented after it is created.
If everyone were always using the rms package with knitr I would be tempted to have an initial setup chunk that compiles all the Stan code in rms and stores it in a list, and cache only that chunk. But faithful serialization would solve all this.
stanc just generates the C++ code. At some point, stan_model has to be called on that C++ code in order to compile it, which is what rstantools facilitates after having gone through it with rstanarm.
Thanks Ben. If you think of a way to do a “partial rstantools” approach for a package’s optional use of rstan please pass it along. In the meantime I’ll explore these options:
Recommending the use of knitr with cache=TRUE for the Stan compile chunk, hoping that the cache file is always faithful in its serialization of R objects
Explore whether I should put Bayesian modeling functions in a separate package that requires rstan and may end up requiring rstanarm too, with pre-compiled Stan code only. For now I hesitate to do that because the Bayesian fitting functions are used with a lot of rms functions, not all of them exported.
For a while, the prophet package would compile its Stan program at installation time (i.e. it wasn’t optional), which at least insures that the timestamps are right, but a few months ago they wisely decided to go the rstantools route.
There is an example of caching the compiled model in a .Rmd file at