Best practices for saving multiple large stanfit files


#1

Hello, This seems like it should be simple and is related to this question. I am trying to run a sensitivity analysis for several parameters using rstan on an HPC cluster. I am generating multiple simulated data sets and apply four stan models to each dataset using:

map(seq_along(list.of.simdata), function(x) sampling(model1,
                    data = list.of.simdata[[x]],
                    iter = n_iter, warmup=n_wmp, chains=n_c, seed = 8029,
                    control=list(adapt_delta = 0.99, max_treedepth = 16)))

The result is a list of multiple sstanfit objects (length = length of list.of.simdata). I have been trying to use saveRDS to save each of these lists, but the resulting file is usually more than 6GB and eventually overruns my available disk space. I’ve tried adding the compress="xz" command to saveRDS (because this seems to achieve the greatest amount of compression).

Unfortunately, that dramatically increases the length of time to save the file. I eventually need to compare the posterior draws for about 15 parameters to the originally simulated values so all I really need is the draws along with any of the sampler parameters to ensure that I didn’t get any warnings (divergences, BFMI, etc). I’m wondering what the best way is to ensure that I retain the ability to access samples, evaluate sampler performance, and run diagnostics (e.g., traceplots) without exhausting disk space and while retaining the ability to open files on my local machine in a new R session…

Is it better to just extract the draws and sampler info and save them as their own objects? What are the drawbacks to doing that (rather than finding an efficient way to compress the list of stanfit objects)? Any pointers would be most appreciated.


#2

If you’re running out of storage extract the subset you need to arrays and write those using saveRDS. You can calculate diagnostics prior to reducing what you keep and save those also as separate objects. As a bonus you reduce load time.