I’m looking for ideas on how to process large (multi-Gb per chain) cmdstan output csvs
I’m pretty comfortable with shell scripts, data.table, databases etc, but at the moment I only have 64Gb RAM locally (& a finite lifespan, I presume), so I need to avoid copying objects in memory during processing, which I presume the posterior package hasn’t avoided entirely. Either way it tends to blow up e.g. when I start calling
summary. I also want to make the most of multiple cores where I can
My current tentative plan is to try the base cmdstan summary & diagnostic outputs, then parse each chain into a data.table using
sed preprocessing step to remove the unnecessary lines. After that I can do any necessary calculations in parallel in data.table without too much copying, or dump it into a database. I’m also thinking about converting the posterior back to json & using Stan for some of the postprocessing.
I just thought I’d ask for other suggestions before I get too deep in the weeds - what big-posterior methods have you had good (or bad) luck with?
OK, so I’ve been playing with this most of the day. I’ve dropped all
generated quantities from the Stan file, which reduced the output file size to ~1Gb each, and allows me to run
fit$summary() with plenty of memory to spare. The
.cores argument to
fit$summary() really helps, and I’ll be shifting all postprocessing to use (standalone generated quantities)[12 Standalone Generate Quantities | CmdStan User’s Guide].
That is one way of solving it, the other would be to run summary on parameters/transformed parameters only.
logistic_model_path <- system.file("logistic.stan", package = "cmdstanr")
logistic_data_path <- system.file("logistic.data.json", package = "cmdstanr")
mod <- cmdstan_model(logistic_model_path)
fit <- mod$sample(data = logistic_data_path)
param_names <- c(names(mod$variables()$parameters), names(mod$variables()$transformed_parameters))
gq_names <- names(mod$variables()$generated_quantities)
Thanks for the suggestion @rok_cesnovar.
I’d been considering which approach to take for a while, & had finally decided to split my generated quantities out to allow more flexible postprocessing without refitting the original model.
In its production form this model is running across ~200 datasets on a cluster - now I’ve got the base model right I don’t want (or can’t really afford) to have to refit them all each time I want to tweak the output!
That makes total sense! I just wanted to post an alternative approach for anyone coming to your thread later.
From my experience, the issue you outlined is something common in cases where Stan models are used in "production " - industry or academia.
I am using separate generated quantities as well across models I run in production with 100k+ quantities and definitely encourage everyone to use it. It can sometimes be a bit annoying to implement it with 2 separate models.
Hi, just to pick up on the issue of large CSVs in your original question, I’ve always preferred to wrangle them in C++, that just being what I learnt in my youth. Dump them out and then deal with it, rather than constraining yourself to your RAM or interface software (I’m looking at you RStudio). This ageing repo is not great by anyone’s standards but might give some ideas to someone.