Report: CmdStan Binary Output Format
Fork with code: GitHub - sims1253/cmdstan: CmdStan, the command line interface to Stan
Independent invention: GitHub - WardBrian/cmdstan at basic-binary-output, which is basically the same design as you have but I did merge the metadata and output into one file with a header.
What I did not do was actually any timing to detect it this was worth it, so that’s nice!
We’ve previously had design docs on using preexisting format, but those have usually required a lot more design work across the code base to integrate. Something simple like what we’ve both done is obviously more achievable, but I am a bit nervous to start essentially defining our own data formats with no other readers
I love this! I was trying to get this GitHub - beve-org/beve: High performance, tagged binary data specification. Glad to see work on this progressing
Interesting format. I can’t find an R reader, and while having typed information would be a plus for a lot of reasons, it would also mean that we would need to do the refactors that have been blocking e.g. apache arrow. Dumping everything as doubles like was done here is already essentially supported out of the box, even if it does have some obvious drawbacks.
Yeah, i went with the presented version because of how easy and light it was on cmdstan. There’s a header-only lib for beve (glaze) that I’ll check out but it’s still big. Arrow/parquet is gonna be a fun challenge as well. Might take me a few days to get through it but I’ll leave an update here when I get something new.
Both Arrow (used to be called Feather) and Parquet file formats support including meta-data in the same file, and e.g. Python and R packages can read that metadata. These are industry standards and thus well supported. Arrow format is simpler and supposed to be the faster and probably is sufficient for Stan purposes. Both formats support typed data, but even if dumping everything as doubles, it could be possible to include type as column metadata.
I think the problem there is the potentially quite heavy arrow dependency for cmdstan. But I’ll try it out anyway :)
Is an interesting development if you are looking for lightweight arrow implementation.
Very cool, thanks for working on this!
How do I make more people do things like this? ;)
Exactly. It’s not at all clear that it would be faster, more robust in any situations we care about, or easier to code with a generic I/O system like Arrow than it is to just write the two or three things we need manually.
Edit: I meant to add that I’m just going with the manual binary encoding for the Walnuts implementation I’m working on. I’m following the lightweight standalone interface approach of Nutpie and tinystan.
And if we (or I suppose you would be more accurate) run your own binary format and document that properly it’s super easy to write readers for that in any language anyway.
I won’t tell you how to nerdsnipe me…
I did some additional reading and I now understand the dependency complexity. I’m now also in favor of simple binary file format
I use glaze for another project and it is great. Sadly it is C++23 only which would be too high of a C++ version for Stan.
I use parquet for stan samples a lot. Columnar format is very important for huge models, as it allows to quickly calculate posteriors based only on the needed parameters. I am not sure a columnar format supports streaming writes though, so a post-processing step might be inevitable
And here is the update: CmdStan Binary Output Formats
I added a second version of the binary format that is now a single file. And also parquet, feather, arrow and beve.
It looks very much like running your own binary format wins in terms of being light and fast. The only drawback compared to csv is that it is no longer human-readable. All of the other options have rather heavy dependencies and don’t really offer benefits from the perspective of just writing and reading the draws to and from the disk fast.
This is great! Thank you for putting this together!
Can you put a table with one sentence about what each format/reader are (and a link to a repo, if it exists)?
@WardBrian how do we deal with complex numbers today in the csv? Is this something we want a binary format to natively support?
Right now it gets output as two columns, something like foo.real,foo.imag
It would be great if a output format supported this (or any form of typed output for that matter, e.g. integers separated from doubles, or arrays actually having all their data associated with each other), but that would certainly require extra machinery that the current IO code lacks
I updated the report with a table. Hope that makes it clear.
I don’t know enough about binary formats to add anything here but I just want to say a big thank you for work on this. Hard drives everywhere are grateful.
And another tiny update: I played around with moving the writing to it’s own thread and also added compression just for fun.
Model: 10,007 parameters, 4,000 samples (306 MB uncompressed)
| Configuration | Time | File Size | Notes |
|---|---|---|---|
| Sync (baseline) | 20.28s | 306M | Single-threaded write |
| Async, no batching | 19.73s | 306M | Ring buffer, write per sample |
| Async, batch=64 | 20.25s | 306M | Batch 64 rows, then write |
| Async, batch=256 | 20.21s | 306M | Batch 256 rows, then write |
| Async, batch=1024 | 20.45s | 306M | Batch 1024 rows, then write |
| Async, batch=ALL | 20.93s | 306M | Buffer everything, write at end |
| Async + ZSTD, batch=64 | 20.30s | 293M | Compress every 64 rows |
| Async + ZSTD, batch=256 | 20.31s | 293M | Compress every 256 rows |
| Async + ZSTD, batch=ALL | 21.44s | 293M | Compress everything at end |
Looks like for large models sampling completely dominates the time and with samples essentially being random draws they are hard to compress. Only real benefit would be the non-sample columns but the row-major format puts those far away from each other and buffering the entire result before writing (which would theoretically give you the best compression) requires more memory again. For small models it’s not worth it in general. Haven’t looked at read times but I assume they won’t be significantly different as well.
I guess you could make the argument that compression is almost free in terms of time, so why not add it? But that adds the zstd dependency to cmdstan for a tiny improvement in terms of required storage.