Support for caching/reusing fits in CmdStan / core Stan

Working on the SBC package, I realized that some support for storing completed fits to avoid recomputing when the data/model does not change would be useful. And I’ve already implemented some form of fit caching on multiple occasions for my own scripts and recently also improved the way caching/storing fits works in brms. So before implementing some caching mechanism once again in R, I thought that maybe this is a feature that could be useful to have either in the interfaces or even provide some support for it in core Stan (to make implementation in interfaces easier).

This is something I actually care about a bit, so I’d be willing to put some effort in drafting a design doc and implementing it. But before that, I’d like to get some less formal feedback on the broad outlines of the idea.

I imagine that for CmdStan this could be done quite easily:

  1. A hash of the input data and a hash of the model code would be added to the CSV output header.
  2. A new switch (e.g. cache=yes) would be introduced. If set, the program will check if the output file exists and if yes then check that the model and data hash as well as all algorithm parameters (likely except for seed) match what is stored. If it matches, the program terminates immediately.

I imagine that code to compute the hash of input data could be introduced to Stan core.

Tagging @mitzimorris, @rok_cesnovar, @ariddell, @ahartikainen and @jonah as interface devs.

3 Likes

Also tagging @mike-lawrence since I know he’s played around with the stantargets package a bit as well as with checking the generated c++ to recompile only on substantive changes to code.

1 Like

I think that this kind of functionality should remain outside of core Stan and core CmdStan. if you want to build a Stan IDE, by all means, do so. CmdStan is very easy to wrap.

2 Likes

httpstan handles caching correctly.

In aria I cache the model exes. I had data cacheing as well but removed it as it induced more complexity than I felt was useful given how fast the translation of data to json is.

1 Like

For SBC generally I’ve come to the conclusion that data generation should be deterministic with a saved seed but shouldn’t bother being cached. Putting generate-fit-summarize all in one function is what I recommend.

2 Likes

I agree that avoiding bloat is an important consideration and it is quite possible that my proposal is not a reasonable tradeoff between added complexity and improved functionality. I’ll still try to make a bit stronger case, but I will definitely let @mitzimorris and others who know the project better have the last word. The bonus of having this closer to the core is that it is easier to implement as CmdStan has all the information (e.g. default values of algorithm parameters etc) that the wrappers don’t necessarily have. My experience is that this type of functionality is quite broadly deemed useful. My main motivation is that it would let us easily avoid a bunch of problems that arise with the way caching is currently done in brms (which has a lot of users), but agree that this is far from the only consideration.

To be a bit more constructive: If I were to implement caching in a wrapper or additional tool, the biggest obstacle would be that I would need a place to store the additional info together with the output .CSV files, making the implementation more complex and fragile. However a minimal support on the part of CmdStan for such a tool (and probably some other future tooling improvements) would be an option to add a custom string (or just a tag/value pair) to the CSV header. This way a wrapper could store any metadata it needs for improved tooling in the .CSV file and avoid the double storage problem. Would that be in your view a sensible extension?

I also presume that you would not think it is sensible to support caching at the level of cmdstanr for similar reasons?

1 Like

I wish I had a more constructive way of saying no, your proposal is really not reasonable or necessary. your problem is that you want to store more information about the run. your solution is to add more to the Stan output CSV header.

the problem is that this is a complete abuse of CSV format - and the Stan output CSV has already more than abused CSV format by sticking the adaptation information after the CSV header line. the solution to your problem and a whole bunch of other problems is better I/O, in this case, better O.

I strongly object to putting band-aid’s worth of code here and there in the core Stan code base. it’s bloat and a burden. based on what others have said, the information you want isn’t that hard to manage yourself.

3 Likes

Makes sense. I agree that maybe it’s better that I stick to custom code where I need it and wait until the I/O refactor, or maybe not do it at all, because maybe the little convenience I get from having something like that in core might not be worth the potential burden.

3 Likes