Support for caching/reusing fits in CmdStan / core Stan

martinmodrak · May 18, 2021, 4:52pm

Working on the SBC package, I realized that some support for storing completed fits to avoid recomputing when the data/model does not change would be useful. And I’ve already implemented some form of fit caching on multiple occasions for my own scripts and recently also improved the way caching/storing fits works in brms. So before implementing some caching mechanism once again in R, I thought that maybe this is a feature that could be useful to have either in the interfaces or even provide some support for it in core Stan (to make implementation in interfaces easier).

This is something I actually care about a bit, so I’d be willing to put some effort in drafting a design doc and implementing it. But before that, I’d like to get some less formal feedback on the broad outlines of the idea.

I imagine that for CmdStan this could be done quite easily:

A hash of the input data and a hash of the model code would be added to the CSV output header.
A new switch (e.g. cache=yes) would be introduced. If set, the program will check if the output file exists and if yes then check that the model and data hash as well as all algorithm parameters (likely except for seed) match what is stored. If it matches, the program terminates immediately.

I imagine that code to compute the hash of input data could be introduced to Stan core.

Tagging @mitzimorris, @rok_cesnovar, @ariddell, @ahartikainen and @jonah as interface devs.

jsocolar · May 18, 2021, 5:04pm

Also tagging @mike-lawrence since I know he’s played around with the stantargets package a bit as well as with checking the generated c++ to recompile only on substantive changes to code.

mitzimorris · May 18, 2021, 5:34pm

I think that this kind of functionality should remain outside of core Stan and core CmdStan. if you want to build a Stan IDE, by all means, do so. CmdStan is very easy to wrap.

ahartikainen · May 18, 2021, 5:37pm

httpstan handles caching correctly.

mike-lawrence · May 18, 2021, 5:42pm

In aria I cache the model exes. I had data cacheing as well but removed it as it induced more complexity than I felt was useful given how fast the translation of data to json is.

mike-lawrence · May 18, 2021, 5:45pm

For SBC generally I’ve come to the conclusion that data generation should be deterministic with a saved seed but shouldn’t bother being cached. Putting generate-fit-summarize all in one function is what I recommend.

martinmodrak · May 20, 2021, 9:51am

I agree that avoiding bloat is an important consideration and it is quite possible that my proposal is not a reasonable tradeoff between added complexity and improved functionality. I’ll still try to make a bit stronger case, but I will definitely let @mitzimorris and others who know the project better have the last word. The bonus of having this closer to the core is that it is easier to implement as CmdStan has all the information (e.g. default values of algorithm parameters etc) that the wrappers don’t necessarily have. My experience is that this type of functionality is quite broadly deemed useful. My main motivation is that it would let us easily avoid a bunch of problems that arise with the way caching is currently done in brms (which has a lot of users), but agree that this is far from the only consideration.

To be a bit more constructive: If I were to implement caching in a wrapper or additional tool, the biggest obstacle would be that I would need a place to store the additional info together with the output .CSV files, making the implementation more complex and fragile. However a minimal support on the part of CmdStan for such a tool (and probably some other future tooling improvements) would be an option to add a custom string (or just a tag/value pair) to the CSV header. This way a wrapper could store any metadata it needs for improved tooling in the .CSV file and avoid the double storage problem. Would that be in your view a sensible extension?

I also presume that you would not think it is sensible to support caching at the level of cmdstanr for similar reasons?

mitzimorris · May 20, 2021, 3:03pm

I wish I had a more constructive way of saying no, your proposal is really not reasonable or necessary. your problem is that you want to store more information about the run. your solution is to add more to the Stan output CSV header.

the problem is that this is a complete abuse of CSV format - and the Stan output CSV has already more than abused CSV format by sticking the adaptation information after the CSV header line. the solution to your problem and a whole bunch of other problems is better I/O, in this case, better O.

I strongly object to putting band-aid’s worth of code here and there in the core Stan code base. it’s bloat and a burden. based on what others have said, the information you want isn’t that hard to manage yourself.

martinmodrak · May 20, 2021, 3:59pm

Makes sense. I agree that maybe it’s better that I stick to custom code where I need it and wait until the I/O refactor, or maybe not do it at all, because maybe the little convenience I get from having something like that in core might not be worth the potential burden.

Topic		Replies	Views
CmdStanCache: caches Stan MCMC for quicker model iterations General	2	328	January 30, 2023
Create `stanfit` object faster Algorithms rstan , performance , algorithms , brms	5	1231	February 16, 2022
Retrieve code from CmdStanMCMC object in R CmdStan	5	945	January 26, 2022
Deleting model fits to prevent full hard drive PyStan fitting-issues	2	687	August 5, 2022
Quarto Rendering library(cmdstanr) CmdStan	7	147	July 17, 2025

Support for caching/reusing fits in CmdStan / core Stan

Related topics