Proof of concept: Binary output format for cmdstan

I appreciate the optimism but this discussion has been around longer than you realize. I’m all for getting it to happen for what it’s worth.

Would a binary format still be standard enough to support partial reads or streaming access?

This is something I’ve been interested in for a while, both for handling very large draw files (even after splitting computation into a fit step and then separately computing generated quantities) and for more gracefully handling constrained variable types. One concrete pain point for me is sum_to_zero_vector: I currently have to switch back and forth between that in the model and a plain vector for the same parameter when calculating generated quantities.

In practice, the draw files I work with have been large enough that even on machines with ~256GB of RAM, summarizing them naively isn’t feasible. I’ve ended up writing C++ code that streams variables from disk and computes means, SDs, etc. on the fly to avoid loading everything into memory. A binary format that supports efficient partial reads or memory mapping could make this much cleaner.

Is this due to precision issue? These would more or less automatically go away with a binary format, but in the mean time you could also consider defining STAN_MATH_CONSTRAINT_TOLERANCE to some larger value at compile time

I think part of the conflict here is that the samples come out row-wise so for a column-major format you either have to cache the samples before writing them or do a 2-pass save. I’ll spend some time thinking on this. Currently preparing the first draft of Proposal and PRs to have something concrete to discuss. It should be possible to get all that into a binary format (after all, we can do whatever we want in there) but finding an elegant solution will take me a moment.

Doing column-wise row-chunks works well and that’s how arrow deals with that. It plays nicely with vector instructions, etc…

Yes.

the mean time you could also consider defining STAN_MATH_CONSTRAINT_TOLERANCE to some larger value at compile time

Thanks, I’ll look into it! It seems that for use with cmdstanr:::generate_quantities, the constraint could just be disabled since the fit had already required whatever the constraint was, and we’re just passing the draws back in for posterior calculations? If this is too tangental, feel free to move to another discussion.

I think we should do the simplest thing for now. This can be a place we optimize later. We could even have a little utility that transposes the data if we think that would help users.

@ssp3nc3r which Stan version you are using? A couple releases ago, the default number of digits saved to CSV was increased to reduce the probability of constraint errors in generated quantities. I’m curious whether you still have these problems with the latest release? Also instead of changing STAN_MATH_CONSTRAINT_TOLERANCE at compile time, you can increase the number of saved digits in csv when calling CmdStan. For example, with CmdStanR use option sig_figs

I’m using 2.37 and have the issue, and before the significant default digits increased from 6 (I think), I manually set them to 8 but it didn’t help. These vectors are up to >100K so I assume it becomes really hard to ensure an exact sum when all those lose digits.

@scholz are you planning on turning this into a design document? I’m happy to help (or write it, if you aren’t able or inclined)

I got kinda swept away by life and forgot, thanks for reminding me. I have the design proposal like 80% done. Should be able to push it over the weekend.

Add binary output format RFC by sims1253 · Pull Request #58 · stan-dev/design-docs · GitHub for those that follow here :)

Strictly speaking that’s not true, I’ve previously shared an implementation of cmdstan here that created a writer using arrow. It works fine without changes beyond cmdstan.

My main point is that the previous discussions are worth reading.

I’m not sure how relevant this is to the cmdstan discussion, but if you are interested in some prior work, nutpie uses zarr to store large traces directly to disk in a binary format, so that arviz can read the results in right away without having to fit everything into memory.

For in-memory representation it uses arrow, which has been a pleasure to work with.

If there was a bit of a standard for different samplers, it would be much easier on downstream libraries like arviz.

I am very familiar with the previous discussions. It would certainly be possible to fit Arrow into the existing callbacks interfaces, but it would inherently be missing some desirable features or face unnecessary implementation difficulty to do so, because the writer interface as it is currently designed destroys useful information.

If you wanted, for example, the fact that all the scalars inside a matrix[3,4] a parameter are related, you would need to be re-creating that on the fly based on the string names out of the columns of the writer. Possible, but annoying, and definitely sub-optimal compared to a writer designed to keep this information around, since the model itself had it in the right shape mere moments before passing it to the writer to begin with.

Even worse, ff you wanted to store an int b as an integer, you’re just out of luck, the current interface completely obscures this information and only provides the value as a double.

I think that it would be surprising to provide Arrow files that don’t contain this kind of organization, whereas a format that is basically just a binary table doesn’t falsely imply the output is better behaved than it is, so these warts in the writer interface are less crucial (to me, anyway)

The R interfaces parse those missing details from the column names and my arrow writer did the same. Arrow lets you construct a scheme once so it’s not even a performance issue (and faster than dumping things to text anyway). Those aren’t unusual challenges for any writer implementation in Stan.

Yes, my comment mentioned this explicitly. I believe your Arrow implementation was before we had tuples in the language, which vastly complicates the kinds of reshaping needed. To my knowledge, our R interfaces still don’t get it right. Even when you do, it would be unnecessarily slow to do it this way.

The fact that a variable is an integer is not recoverable from the name alone

I may have missed discussion of this elsewhere or a bug report. Are we not handling tuples correctly in cmdstanr? If not I would ideally like to fix that before we do the major release that I’ve been working on.

Does cmdstanr provide a extract()-like function, or does it just defer to posterior? posterior has pretty lacking tuple support (and complex number support, for what it’s worth). Ideally you’d be able to ask for a tuple variable and get it in whatever the most natural shape is in R (I’m guessing a list?)

For extract()-like behavior we outsource it to posterior, so I guess this is more of a posterior issue than a cmdstanr issue. And @avehtari recently added a very useful function to posterior that is even more extract()-like than we previously had in posterior. @avehtari have you ever tried it with tuples?