Processing Large Posterior

Stuart_Russell · July 27, 2022, 6:02am

Hi All,

I’m looking for ideas on how to process large (multi-Gb per chain) cmdstan output csvs

I’m pretty comfortable with shell scripts, data.table, databases etc, but at the moment I only have 64Gb RAM locally (& a finite lifespan, I presume), so I need to avoid copying objects in memory during processing, which I presume the posterior package hasn’t avoided entirely. Either way it tends to blow up e.g. when I start calling summary. I also want to make the most of multiple cores where I can

My current tentative plan is to try the base cmdstan summary & diagnostic outputs, then parse each chain into a data.table using data.table::freadwith a sed preprocessing step to remove the unnecessary lines. After that I can do any necessary calculations in parallel in data.table without too much copying, or dump it into a database. I’m also thinking about converting the posterior back to json & using Stan for some of the postprocessing.

I just thought I’d ask for other suggestions before I get too deep in the weeds - what big-posterior methods have you had good (or bad) luck with?

Thanks

Stuart

Stuart_Russell · July 27, 2022, 4:08pm

OK, so I’ve been playing with this most of the day. I’ve dropped all generated quantities from the Stan file, which reduced the output file size to ~1Gb each, and allows me to run fit$summary() with plenty of memory to spare. The .cores argument to fit$summary() really helps, and I’ll be shifting all postprocessing to use (standalone generated quantities)[12 Standalone Generate Quantities | CmdStan User’s Guide].

rok_cesnovar · July 27, 2022, 4:25pm

That is one way of solving it, the other would be to run summary on parameters/transformed parameters only.

Example:

library(cmdstanr)

logistic_model_path <- system.file("logistic.stan", package = "cmdstanr")
logistic_data_path <- system.file("logistic.data.json", package = "cmdstanr")

mod <- cmdstan_model(logistic_model_path)

fit <- mod$sample(data = logistic_data_path)

param_names <- c(names(mod$variables()$parameters), names(mod$variables()$transformed_parameters))
gq_names <- names(mod$variables()$generated_quantities)

fit$summary(param_names)

Stuart_Russell · July 27, 2022, 7:21pm

Thanks for the suggestion @rok_cesnovar.

I’d been considering which approach to take for a while, & had finally decided to split my generated quantities out to allow more flexible postprocessing without refitting the original model.

In its production form this model is running across ~200 datasets on a cluster - now I’ve got the base model right I don’t want (or can’t really afford) to have to refit them all each time I want to tweak the output!

rok_cesnovar · July 27, 2022, 7:27pm

That makes total sense! I just wanted to post an alternative approach for anyone coming to your thread later.
From my experience, the issue you outlined is something common in cases where Stan models are used in "production " - industry or academia.

I am using separate generated quantities as well across models I run in production with 100k+ quantities and definitely encourage everyone to use it. It can sometimes be a bit annoying to implement it with 2 separate models.

robertgrant · July 30, 2022, 12:18pm

Hi, just to pick up on the issue of large CSVs in your original question, I’ve always preferred to wrangle them in C++, that just being what I learnt in my youth. Dump them out and then deal with it, rather than constraining yourself to your RAM or interface software (I’m looking at you RStudio). This ageing repo is not great by anyone’s standards but might give some ideas to someone.

Topic		Replies	Views
Combining posterior data from multiple chains when saving .csv output from CmdStanR inference object Other cmdstanr , posterior-package	2	828	February 1, 2022
Reading cmdstanr csv files CmdStan	2	370	October 16, 2023
Importing large cmdstan csv-files to R General	25	3597	June 28, 2021
Slow cmdstanr/posterior vs. rstan summary CmdStan cmdstanr	5	1302	November 16, 2021
Extracting draw summaries prohibitively slow for massive models Interfaces cmdstanr , posterior-package	2	535	May 2, 2023

Processing Large Posterior

Related topics