Proof of concept: Binary output format for cmdstan

I appreciate the optimism but this discussion has been around longer than you realize. I’m all for getting it to happen for what it’s worth.

Would a binary format still be standard enough to support partial reads or streaming access?

This is something I’ve been interested in for a while, both for handling very large draw files (even after splitting computation into a fit step and then separately computing generated quantities) and for more gracefully handling constrained variable types. One concrete pain point for me is sum_to_zero_vector: I currently have to switch back and forth between that in the model and a plain vector for the same parameter when calculating generated quantities.

In practice, the draw files I work with have been large enough that even on machines with ~256GB of RAM, summarizing them naively isn’t feasible. I’ve ended up writing C++ code that streams variables from disk and computes means, SDs, etc. on the fly to avoid loading everything into memory. A binary format that supports efficient partial reads or memory mapping could make this much cleaner.

3 Likes

Is this due to precision issue? These would more or less automatically go away with a binary format, but in the mean time you could also consider defining STAN_MATH_CONSTRAINT_TOLERANCE to some larger value at compile time

I think part of the conflict here is that the samples come out row-wise so for a column-major format you either have to cache the samples before writing them or do a 2-pass save. I’ll spend some time thinking on this. Currently preparing the first draft of Proposal and PRs to have something concrete to discuss. It should be possible to get all that into a binary format (after all, we can do whatever we want in there) but finding an elegant solution will take me a moment.

1 Like

Doing column-wise row-chunks works well and that’s how arrow deals with that. It plays nicely with vector instructions, etc…

1 Like

Yes.

the mean time you could also consider defining STAN_MATH_CONSTRAINT_TOLERANCE to some larger value at compile time

Thanks, I’ll look into it! It seems that for use with cmdstanr:::generate_quantities, the constraint could just be disabled since the fit had already required whatever the constraint was, and we’re just passing the draws back in for posterior calculations? If this is too tangental, feel free to move to another discussion.

I think we should do the simplest thing for now. This can be a place we optimize later. We could even have a little utility that transposes the data if we think that would help users.

3 Likes

@ssp3nc3r which Stan version you are using? A couple releases ago, the default number of digits saved to CSV was increased to reduce the probability of constraint errors in generated quantities. I’m curious whether you still have these problems with the latest release? Also instead of changing STAN_MATH_CONSTRAINT_TOLERANCE at compile time, you can increase the number of saved digits in csv when calling CmdStan. For example, with CmdStanR use option sig_figs

I’m using 2.37 and have the issue, and before the significant default digits increased from 6 (I think), I manually set them to 8 but it didn’t help. These vectors are up to >100K so I assume it becomes really hard to ensure an exact sum when all those lose digits.

@scholz are you planning on turning this into a design document? I’m happy to help (or write it, if you aren’t able or inclined)

1 Like

I got kinda swept away by life and forgot, thanks for reminding me. I have the design proposal like 80% done. Should be able to push it over the weekend.

1 Like