Cmdstanr save_object() takes a long time

n_a_gilbert · July 31, 2023, 3:43pm

I recently switched to using cmdstanr and am learning the differences from rstan, etc. I’ve run a multispecies distance sampling model and am encountering trouble with saving the output with the suggested save_object() method. It is taking forever to save - the model ran in ~4 days, and now has spent >2 days trying to save. This is on a supercomputer, and I gave the job 4 cpus each with 10GB. There should be 2k iterations saved in the posterior. It’s a fairly big model (1000 sites x 10 species x 30 years), but it seems absurd that saving should take this long given how quickly the model itself ran. Has anyone encountered similar issues or have suggestions for troubleshooting the saving process?

Below is the code I used to save the object:

fit$save_object(file = paste0("results_", Sys.Date(), ".RDS"))

Thanks!

Bob_Carpenter · August 1, 2023, 9:45pm

I don’t know what the cause is, but you’re not the only person to have reported this. Maybe @Jonah or @mitzimorris knows the root cause—Mitzi and Jonah have worked on CmdStanPy and CmdStanR and know the most about them.

Is it really 2K iterations across all chains? Are you saving warmup? If it’s really 2K, then the total number of values stored is 2K iterations * 1000 * 10 * 30 parameters/iteration is going to be 600M floating point values at double precision, or about 5GB assuming one parameter for each of those things you listed.

For troubleshooting, you can see if CmdStan has successfully written the draws to one or more files (depending on number of chains). If so, you can move the files out of temp directories, kill the process, and try again.

jonah · August 2, 2023, 5:19pm

The code for the save_object method is pretty simple:

function(file, ...) {
  self$draws()
  try(self$sampler_diagnostics(), silent = TRUE)
  try(self$init(), silent = TRUE)
  try(self$profiles(), silent = TRUE)
  saveRDS(self, file = file, ...)
  invisible(self)
}

If it’s taking a long time it would either be in the self$draws() line or the saveRDS() line. self$draws() makes sure the posterior draws have all been read in from the CSV files, and this can be pretty slow with lots of parameters/transformed parameters/generated quantities. But if the draws have been read in already (via a previous call to draws() or summary() or any method that requires the draws) then it won’t read them in again, in which case it would only be the saveRDS part that’s time consuming. The ... lets you pass arguments to base::saveRDS, so you could change the type of compression it does, but I’m not sure if that would help or hurt in this case.

You can also try saving the object any other way you want. As long as you make sure everything you need has already been read in to memory from CSV (draws, sampler diagnostics, etc) then you can use any means available to R users to save the object instead of using fit$save_object(). There may be faster options that I’m not aware of.

jonah · August 2, 2023, 5:24pm

For example GitHub - traversc/qs: Quick serialization of R objects

saudiwin · December 18, 2023, 2:00pm

Following up on this - would there be a way of importing the CSV files into some kind of other storage format, like parquet, for really big models? Or maybe into an SQLite DB? Having to load all the CSV files into memory can be a limitation for really big models.

jonah · December 19, 2023, 10:18pm

We haven’t implemented anything like that, but it’s definitely something we’re interested in. There is a design document that was approved (but not implemented yet) that mentions parquet support for CmdStan:

github.com

stan-dev/design-docs/blob/master/designs/0032-stan-output-formats.md

- Feature Name: stan-output-formats
- Start Date: 2022-01-11
- RFC PR:
- Stan Issue:

## Summary
[summary]: #summary

This design addresses the problem of creating a general and extensible framework
for handling the outputs of the core Stan inference algorithms.
It provides an alternative to the use of the non-standard
[Stan CSV file format](https://mc-stan.org/docs/cmdstan-guide/stan_csv.html)
as the single record of one run of an inference algorithm.
Instead, the outputs will consist of multiple files, using
standard human- and machine-readable formats, resulting in
a clean separation of different kinds of information into type-appropriate, commonly used formats
which will make it easier to use and create tools for analysis and visualization.
This framework will also make it easier to add new outputs and diagnostics to the inference algorithms.

## Motivation

This file has been truncated. show original

saudiwin · January 7, 2024, 1:07am

That is cool. I may need to do something like that myself just to handle some really big models I’m fitting.

Topic		Replies	Views
$draws() method in CmdStanR is still slow General	17	1279	December 18, 2020
Help with memory issue? - "Error in read_cmdstan_csv(files = self$output_files(include_failed = FALSE)" Interfaces cmdstanr	2	45	March 31, 2025
Saving of CmdStanModel objects from cmdstanr Other cmdstanr	7	1885	July 2, 2020
Saving edited .rstan files taking long time General	9	1981	April 8, 2021
Save out subset of cmdstanr parameters (post-run solution) CmdStan	1	147	November 25, 2024

Cmdstanr save_object() takes a long time

Related topics