I have a model which results in csv files with sizes of tens of GB, This is fine, expect creating the resulting object in R is slow and sometimes impossible due to the size. However, I don’t need to process all the variables at once, so I realized that I can use the following to read in only subset of samples:
(Not sure if there is cleaner way which doesn’t rely on the unexported functionality)
My question is that that when I run model$sample(...), is there a way to tell sample that it should not even try to create and return the results to R (which will cause error due to lack of memory)?
I don’t think this functionality is anywhere exposed by cmdstanr, but I agree that you have a good use case for it and perhaps it would be a nice feature. If you want to use the package internals to do this, check out run_cmdstan here cmdstanr/R/run.R at master · stan-dev/cmdstanr · GitHub
I don’t think that $sample reads in all the samples by default. You have to call $draws for that. Perhaps it is the calculation of diagnostics that runs out of RAM? If so then you can set diagnostics = FALSE if you know what you’re doing.
If you have auxiliary variables for which you don’t need to save draws at all, define them in the model block or in a local block (an extra pair of {}) in transformed parameters.
Working with such large models with R and cmdstanr can be frustrating. Perhaps you will find my package Stanislaw useful. It extracts subsets of draws directly from CmdStan CSVs and can also calculate posterior summaries much faster than $summary.
This is right, it shouldn’t read in all the draws until you ask it to do something that requires them (e.g. $draws(), $summary(), printing, etc.). For turning off reading in the diagnostics I would use diagnostics="" or diagnostics=NULL, although FALSE might also work, I haven’t tested it.
Thanks, indeed I was mixing my experiences with rstan fit object; the out of memory issue actually happens later in the batch jobs when using fit$save_object().
I prefer not to define these variables inside local {} or inside model block as I need them also later in generated quantities, although I could of course just recompute them as it probably doesn’t matter much in terms of the overall computing time.
I have this dilemma often. Yes it does not matter much in terms of time, but such duplicated code means bugs and it can rarely be refactored into a function. I’d love it if the language had a decorator that you could apply to a variable to exclude its draws from the CSVs.
Yes, and I still am interested in that feature (the most recent idea was you would annotate the variable with @silent). If I remember correctly the primary concerns were that it interacts badly with things like standalone generated quantities - a model that had a silenced variable can’t have its results loaded back in for further processing by the same model. But that also seems like an obvious and “fair” tradeoff.
It would be great to have something like @silent in Stan code, I too find repeated the code in multiple places in order to avoid saving some auxiliary stuff annoying and “dangerous”.
However, disregarding the issue of extra variables in the output CSVs, I think there’s a simple solution for avoiding reading everything to R: Just include argument variables for as_cmdstan_fit() and pass it to read_cmdstan_csv() which already accepts variables argument?
I’m going to try this out as soon as I get a chance. Busy pre-Christmas though. Do we have alternative outputs to CSV? It hasn’t been a limit for me in the past but it is such a storage-inefficient format.
Unfortunately we’re still only using CSV. There have been various proposals for other formats (which would need to be changed in CmdStan itself), but as far as I know we haven’t had a developer take on that project yet.