Just a disclaimer that I posted a more or less identical question on stack overflow.
I ran a Bayesian regression on a large, messy health dataset recently, a longitudinal mixed effects model with random slopes and nested intercepts (encounters within clients). As a result I have an R object that is comfortably larger than anything I have used before: 7722886880 bytes (7.7 gig!)
When I tried to save this object as an RData file it took a long time to save and then failed, with the error
Error in gzfile(file, "wb") : cannot open the connection
In addition: Warning message:
In gzfile(file, "wb") :
cannot open compressed file 'fooFolder/fooSubFolder/foo.RData', probable reason 'Operation canceled'
Which I assume means ‘file’s too big holmes’. Does anyone know a way to allow me to save the object? Maybe some
options() command that allows me to override the time out? It took a long time to run the model and k-fold validation. I’d rather not do it all again. I figure some people on this forum may have encountered similar problems.
Start with two quick checks that are a bit pedantic, but will save you additional suffering if they are the culprit:
- Make sure you have enough disk space to save a large file. The file size on disk, and thus free space required, is likely to be somewhat larger than the byte count
object.size() reports. It’s hard to say exactly how much, but for 7.7 GB in memory, maybe you want at least 15 GB of free disk space?
- Make sure you have specified just one object in
save(). I can’t count the number of times I have waited ages for a small object to save only to realize that I was inadvertently saving all the objects in my workspace.
Assuming these weren’t the cause, there are still a couple of ideas to try:
- Use a different compression algorithm via the
compress argument. This is more likely to offer a size on disk improvement rather than a save time speed up. I get the smallest file sizes using
- Switch to use the
qs package, which offers a much faster and highly tunable serialization format.
- Step back and consider whether you need to save the whole object. Do you need all of the diagnostics, initial values, etc. or could you make do with just the posterior draws?
In the R
blrm function I go to a lot of trouble to save lean fit objects and to reference such saved files when determining if anything has changed that makes us need to run the sampling again. You might profit from looking at the code on CRAN or
GitHub.com/harrelfe. I make sure that no environments are carried along, say when storing functions created during model fitting. These R functions carry along multi-go environments that are not needed. I store them as character strings in the fit object to avoid that.