Cmdstanr fails to read its own csv files for large number of parameters

ihrke · May 12, 2021, 8:13pm

I am fitting a hierarchical Hidden-Markov model with cmdstanr. I am using data from 20 subjects, each with around a 1000 trials (distributed across 20 block). The forward algorithm is implemented directly, hence I have to store about 20 x 20 x 100 variables for the forward-variables (and an equal number for the backward- and forward-backward smoothed ones as I wish to estimate by-trial probabilities).

Fitting the model works just fine and with lower number of subjects/blocks/trials, there is no issue whatsoever. However, when using the full dataset, cmdstanr cannot read back its own output files. In fact, it gets stuck in some obscure computation when trying to access any of the fitted model variables (such as using fit$draws() and even if trying to use fit$save_object(). This is the case, even if I use fit$optimize() instead of fit$sample(), even though the resulting output file from the fit (attached) is has only a single line (but a lot of variables, obviously) and is only about 9 MB in size.

This is the cmdstanr-file from an optimize()-run: https://dropfiles.org/asHh3fmL

Is this a known issue? Can I do anything to circumvent the issue? Currently, I am switching back to rstan.

Operating System: Linux (Debian 6.3.0-18)
CmdStan Version: 2.26.1
Compiler/Toolkit: GCC 6.3.0 20170516

mike-lawrence · May 12, 2021, 8:20pm

Can you check on the raw size of the csv’s?

ihrke · May 12, 2021, 8:25pm

It’s just 9 MB for the optimize-run. Here is the file https://dropfiles.org/asHh3fmL.

jonah · May 12, 2021, 10:21pm

@ihrke I was able to reproduce this using the file you shared (thanks for that). The problem is happening when CmdStanR calls posterior::subset_draws() towards the end of cmdstanr::read_cmdstan_csv.

I made a branch that has a temporary fix for this when reading in the csv after optimization (I think ultimately we need to fix posterior::subset_draws()):

remotes::install_github("stan-dev/cmdstanr@temp-fix-optimize-csv")

This should get it to work with optimization (at least it allows me to use read_cmdstan_csv() successfully with the file you provided). Unfortunately I’m not sure where the problem is happening when you’re using sampling but if you share that csv I can probably track it down.

Edit: @ihrke I updated the branch to avoid using subset_draws also for sampling so perhaps it will solve that for you too but I’m not 100% sure.

jsocolar · May 12, 2021, 10:22pm

Yeah, this is a known problem. A workaround for now is to use cmdstanr::read_cmdstan_csv() to read the csv files directly instead of using $draws().

Edit: @jonah notes that I’m wrong about this. Just for posterity, it is also currently the case that read_cmdstan_csv works for sampling fits with large numbers of parameters, but $draws() does not.

jonah · May 12, 2021, 10:23pm

In this case it also seems to happen with read_cmdstan_csv() unfortunately.

jonah · May 12, 2021, 10:34pm

Yeah, thanks for the reminder, I had forgotten about that. @rok_cesnovar We need to get back to that and figure something out.

jonah · May 12, 2021, 10:44pm

Actually I’m now pretty sure that the problem with draws() is also related to this issue with posterior::subset_draws() that I just opened:

ihrke · May 13, 2021, 10:17am

That’s amazing, thank you! I will try it asap!

ihrke · May 13, 2021, 11:49am

I tried using the branch with your fix and reading the csv now seems to work fine. I use the following convenience function to read the whole fit into memory (before storing it as an .RData file) and it is lightning fast (as opposed to taking ages before your fix).

cmdstanr.resolve <- function(fit){
  temp_rds_file <- tempfile(fileext = ".RDS")
  fit$save_object(file = temp_rds_file)
  fit <- readRDS(temp_rds_file)  
  return(fit)
}

However, I cannot use the fit$summary() or fit$draws() function for this object as I used to. The error I get is

> mod_opt_probed.r$draws("gmu")
Error in `[.default`(private$draws_, , variables, drop = FALSE) : 
  subscript out of bounds
> mod_opt_probed.r$summary()
Error: Can't subset columns that don't exist.
x Columns `variable` and `mean` don't exist.
Run `rlang::last_error()` to see where the error occurred.

jonah · May 13, 2021, 3:18pm

Oops I may have broken draws() on that branch when I fixed the CSV reading. But I think this PR that we just merged in the posterior package

github.com/stan-dev/posterior

make subset_draws efficient

stan-dev:master ← Ozan147:subset_draws/#129

opened 09:15AM - 13 May 21 UTC

Ozan147

+28 -12

Addresses #129. Makes `check_existing_variables()`, and therefore `subset_dra…ws()` much more efficient. For example, with default arguments, this suggested version can process 300k variables in a fraction of a second while the current version takes almost 10 minutes. However, both versions still have comparable runtimes when `regex = TRUE`. ``` x <- as_draws_matrix(matrix(rnorm(10 * 300000), nrow = 10, ncol = 300000)) microbenchmark::microbenchmark( posterior:::check_existing_variables(variables = colnames(x), x = x), posterior::subset_draws(x, variable = colnames(x)), times = 10, unit = "s" ) Unit: seconds min lq mean median uq max neval 0.3240343 0.3354325 0.3669232 0.3541255 0.3870651 0.4765244 10 0.4236527 0.4271074 0.4862439 0.4566513 0.5303358 0.6543742 10 ```

will hopefully fix the problem. Can you try reinstalling both posterior and cmdstanr from master?

remotes::install_github("stan-dev/posterior")
remotes::install_github("stan-dev/cmdstanr")

and let let me know if that fixes the problem? Sorry for the hassle, but thanks for helping us fix this!

ihrke · May 13, 2021, 5:58pm

Thanks, that seems to have fixed the problem! Thanks so much for your efforts and the incredibly fast fix!

jonah · May 13, 2021, 6:04pm

That’s great, thanks for trying that out. Glad it’s working now!

jonah · May 13, 2021, 9:59pm

@jsocolar I’m hopeful that with posterior::subset_draws() now fixed this will drastically improve the speed of $draws() with many parameters. I haven’t done any rigorous testing yet though.

jsocolar · May 13, 2021, 10:01pm

@jonah, can I test by updating posterior without rebuilding the R6 object, or do I need to rebuild after updating?

jonah · May 13, 2021, 10:28pm

My hunch is that you’ll need to rebuild the R6 object. Unfortunately if we update a method it doesn’t update the methods associated with existing R6 objects.

However, there may be an alternative: do you by any chance still have the CSV files or were those just written to temp files? If you still have the CSV files associated with the old R6 object then you can recreate the R6 object without having to rerun the model using as_cmdstan_fit(paths_to_csv_files). Then the resulting fit object would use the latest draws method.

jsocolar · May 13, 2021, 10:29pm

Yeah, that’s what I meant by rebuild :)
I’ll go ahead and give it a crack.

jonah · May 13, 2021, 10:32pm

Cool, thanks for trying. You might also try using format = "draws_list" when running as_cmdstan_fit. According to @rok_cesnovar that’s the most efficient format to use if there are a ton of parameters. (That will just affect how the draws are stored internally. If you then use draws() it will use the regular default of “draws_array” unless you specify a different format.)

jsocolar · May 14, 2021, 1:53am

My 3yo arrived home so I just got around to this.

$draws() is now blazing fast on a fit where it was previously unusable (250K parameters, now takes about 30 seconds, previously I killed it after 90 minutes).

jonah · May 14, 2021, 3:52am

Awesome, that’s great news! Thanks for testing it out for us.

Topic		Replies	Views
Reading cmdstanr csv files CmdStan	2	368	October 16, 2023
I can't get summary of my model CmdStan	2	696	July 29, 2022
Error during sampling using CmdStanR Modeling fitting-issues	5	382	October 31, 2022
CmdStanR returns "grep: write error" and "All variables must have the same length" CmdStan cmdstanr	17	1373	March 18, 2023
Help with memory issue? - "Error in read_cmdstan_csv(files = self$output_files(include_failed = FALSE)" Interfaces cmdstanr	2	41	March 31, 2025

Cmdstanr fails to read its own csv files for large number of parameters

Related topics