Cleaning up files generated by cmdstanpy

Hi - What is the suggested workflow to clear the working directory of files generated by CmdStanModel.sample when using cmdstanpy. Are people storing these in a temporary directory and then deleting that after sampling? Has it been proposed to have an argument in sample to clear the output automatically when sampling is complete?

2 Likes

I think the desired workflow depends on how much you want to forget that CmdStan is running in the background: if you want to dig into the CSV files later (like calling stansummary) then it’s good to keep them. If you want a fully Python workflow where you kept results in npy files then it could be useful to auto cleanup.

At some point this was being done by this interface and should not be too difficult to implement.

1 Like

if no output directory is specified, then the CSV files plus stdout and stderr files are all written to a tmp dir which is deleted at the end of the Python session.
the method save_csvfiles will move just the CSV files from whatever directory they’re currently in to a new directory.

(I wonder what happens if you save them to /dev/null?)

If you run the sample method with output_dir specified, then you will get both CSV and stdout and maybe even stderr files in the specified directory. as it’s not a tmp dir, the user can manage it any way they want to - should the interface provide cleanup commands? the hooks are there…

@mitzimorris ok that all makes sense to me. I don’t think a cleanup command is necessary for users that specify their own directory. (I was a little worried that the temporary directory might not get cleaned up when working out of Jupyter notebooks, but I can confirm that the directory does get cleaned up once the notebook server is shutdown.)

Another related question… It looks like the summary method for CmdStanMCMC relies on the file output generated. If I delete the files, the method errors. Was there a reason for this implementation (aside from mimicking cmdstan behavior)? Would it be better to a) precompute the summary stats and store that as an attribute or b) have a method available that can compute those stats on the CmdStanMCMC.sample attribute?

If we have samples in memory, we could save them for a temporary directory and run stats for them. Then there is always ArviZ summary (import arviz as az; az.summary(fit))

3 Likes

there are a couple of related issues here:

  • how lightweight do we want the CmdStanX wrappers to be?

  • how to insure that all CmdStanX wrappers use the same algorithm to compute the summary statistics?

the initial goal for CmdStanPy was modest enough - wrap calls to CmdStan; therefore the CmdStanMCMC stansummary method just wraps a call to CmdStan’s stansummary utility. (note: we should keep the resulting summary csv file around, so that the utility only needs to be run once). in theory, this keeps the Python session memory footprint low, but of course, it’s pushing bumps around under the rug - now the C++ code reads everything into memory. is it faster and does it require less memory? I have no idea.

I very much like @ahartikainen’s suggestion that we hand off the draws (fka sample - Feature/263 cmdstanr harmonize by mitzimorris · Pull Request #277 · stan-dev/cmdstanpy · GitHub) to Arviz - providing that we can standardize the summary statistics at the algorithm level for all CmdStanX interfaces

under the hood, CmdStan’s stansummary utility uses a stan::mcmc::chains object which has methods to compute all the summary statistics. It looks like in 2018 @avehtari updated some of these calculations, and later @roualdes worked on ESS.

CmdStanR is using a different set of methods to compute the summary statistics.
If the current R implementation is the @avehtari et al approved set of calculations, then we should have a language-agnostic spec describing these algorithms, plus a set of test cases so that we can check Python and R-native computations.

does this make sense? @jonah and I discussed this recently during our weekly CmdStanX calls - I would love to get everyone’s thoughts and help.

1 Like

cmdstanr is offloading this to the posterior package, that uses the MCMC convergence diagnostics based on

Vehtari A., Gelman A., Simpson D., Carpenter B., & BĂĽrkner P. C. (2020). Rank-normalization, folding, and localization: An improved Rhat for assessing convergence of MCMC. Bayesian Analysis .

If I understand correctly, the plan for the Stan R universe is to have this in posterior and reuse it everywhere we can, so that we don’t have to update a bunch of duplicated code in all the packages every time we need to update any of the diagnostics.

3 Likes

@mitzimorris yeah that makes sense. Thanks for providing the context. If the design was to be a wrapper, then that’s what it should be. After thinking about it more deeply It’s useful that things are not precomputed. Specifically, when running jobs in production (in industry) it’s more appealing to strip away any unnecessary computation, and cmdstanpy facilitates that by relying on the user execute methods like stansummary.