Importing large cmdstan csv-files to R

csetraynor · September 28, 2019, 8:34pm

Yes it is a dataset of 50,000 groups which I am working my way using hierarchical modelling but I am really just interested about the population parameters and maybe a few individual parameters to show some individual predictions. The issue is that in using the non-centered parameterisation in the transformed parameters block and generated quantities it generates a bunch of individual parameters.

Yes I know but I am using cmd_stan because I am working with map_rect.
Thanks for the feedback anyway!

Thanks! I will try that one too!

bgoodri · September 28, 2019, 9:22pm

If you use the offset and multiplier syntax in the parameters block, you only have one set of the individual specific parameters.

You can use map_rect from an interface to Stan.

ljenniches · June 28, 2021, 12:41pm

There haven’t been any posts here for a long time, but I think the issue still exists. I have been using Stan a lot to analyze RNA-sequencing data and since I also save the log_lik for model comparison, I have files which go up to 18gb. As suggested higher up, I preprocess the files with bash (through system calls from R), and then read in the relevant preprocessed, smaller files with fread using Rscript to avoid the memory issues in Rstudio. I was wondering whether there would be any interest to formalize this a bit more and make it available in form of an R package for the analysis of large cmdstan files (including bayesplot and loo for example). I would be happy to participate. I already have the custom routines for my own project, but I think it would be great to generalize them. Since Stan can be parallelized, sampling time has become less of an issue, but large files still cause memory problems in R.

mike-lawrence · June 28, 2021, 1:31pm

That sounds like it would be a great contribution, but before you dive in maybe take a look at what I’ve done here; I just spent the weekend getting during-sampling cmdstanCSV-to-NetCDF4 working. I’m still working on the access scheme after sampling and also hoping to be able to eventually truncate the CSVs as they’re being read for better storage efficiency. (Note also the important omissions at present.)

ljenniches · June 28, 2021, 1:56pm

Thanks for the prompt answer. That doesn’t quite seem to be what I have in mind. I am assuming that storage is not an issue or at least not a major issue. In that case it would be sufficient to process the cmdstan output, no modification to Stan would be required. Maybe the main problem is that Stan writes all output into a single file. Instead of processing the cmdstan output, one could maybe add an option to Stan that allows to save certain variables (like log_lik) in separate files. Then, the files should usually have a size that can be processed by R using fread.

mike-lawrence · June 28, 2021, 2:41pm

Ah, then yes that gets into modifying how cmdstan writes out data. There has been substantial discussion on that, and while everyone agrees that the current scheme is far from ideal, there’s very little agreement on what to do as an alternative. @mitzimorris mentioned in a recent Stan Gathering that there might be something on the horizon in this domain though.

I’ll note that in the case of limited storage space, my proposal linked above to allow the csv files to be truncated would largely solve that, since the program consuming the csv could even use user-supplied keep/toss-list if only a subset of quantities should be saved.

Topic		Replies	Views
Import certain parameters from cmdstan .csv output files to Stanfit object General cmdstan , techniques	3	907	March 19, 2021
Import csv output from cmdstan in R: How to indicate the chain? General cmdstan , r	12	1594	February 3, 2021
cmdstanPy to cmdStanR CmdStan	13	762	December 2, 2020
Saving executables compiled using RStan Interfaces cmdstan , rstan	5	1412	September 30, 2017
Parallelized loading of csv files Developers r , cmdstanr , posterior-package	3	53	September 6, 2024

Importing large cmdstan csv-files to R

Related topics