Proof of concept: Binary output format for cmdstan

scholz · January 19, 2026, 11:20pm

Report: CmdStan Binary Output Format
Fork with code: GitHub - sims1253/cmdstan: CmdStan, the command line interface to Stan

WardBrian · January 20, 2026, 12:13am

Independent invention: GitHub - WardBrian/cmdstan at basic-binary-output, which is basically the same design as you have but I did merge the metadata and output into one file with a header.

What I did not do was actually any timing to detect it this was worth it, so that’s nice!

We’ve previously had design docs on using preexisting format, but those have usually required a lot more design work across the code base to integrate. Something simple like what we’ve both done is obviously more achievable, but I am a bit nervous to start essentially defining our own data formats with no other readers

spinkney · January 20, 2026, 2:17pm

I love this! I was trying to get this GitHub - beve-org/beve: High performance, tagged binary data specification. Glad to see work on this progressing

WardBrian · January 20, 2026, 2:42pm

Interesting format. I can’t find an R reader, and while having typed information would be a plus for a lot of reasons, it would also mean that we would need to do the refactors that have been blocking e.g. apache arrow. Dumping everything as doubles like was done here is already essentially supported out of the box, even if it does have some obvious drawbacks.

scholz · January 20, 2026, 4:41pm

Yeah, i went with the presented version because of how easy and light it was on cmdstan. There’s a header-only lib for beve (glaze) that I’ll check out but it’s still big. Arrow/parquet is gonna be a fun challenge as well. Might take me a few days to get through it but I’ll leave an update here when I get something new.

avehtari · January 20, 2026, 5:37pm

Both Arrow (used to be called Feather) and Parquet file formats support including meta-data in the same file, and e.g. Python and R packages can read that metadata. These are industry standards and thus well supported. Arrow format is simpler and supposed to be the faster and probably is sufficient for Stan purposes. Both formats support typed data, but even if dumping everything as doubles, it could be possible to include type as column metadata.

scholz · January 20, 2026, 5:49pm

I think the problem there is the potentially quite heavy arrow dependency for cmdstan. But I’ll try it out anyway :)

hansvancalster · January 20, 2026, 6:03pm

Is an interesting development if you are looking for lightweight arrow implementation.

jonah · January 20, 2026, 6:54pm

Very cool, thanks for working on this!

How do I make more people do things like this? ;)

Bob_Carpenter · January 20, 2026, 8:17pm

Exactly. It’s not at all clear that it would be faster, more robust in any situations we care about, or easier to code with a generic I/O system like Arrow than it is to just write the two or three things we need manually.

Edit: I meant to add that I’m just going with the manual binary encoding for the Walnuts implementation I’m working on. I’m following the lightweight standalone interface approach of Nutpie and tinystan.

scholz · January 20, 2026, 8:19pm

And if we (or I suppose you would be more accurate) run your own binary format and document that properly it’s super easy to write readers for that in any language anyway.

I won’t tell you how to nerdsnipe me…

avehtari · January 21, 2026, 7:53am

I did some additional reading and I now understand the dependency complexity. I’m now also in favor of simple binary file format

stevebronder · January 21, 2026, 10:21pm

I use glaze for another project and it is great. Sadly it is C++23 only which would be too high of a C++ version for Stan.

valyagolev · January 22, 2026, 10:08pm

I use parquet for stan samples a lot. Columnar format is very important for huge models, as it allows to quickly calculate posteriors based only on the needed parameters. I am not sure a columnar format supports streaming writes though, so a post-processing step might be inevitable

scholz · January 22, 2026, 11:33pm

And here is the update: CmdStan Binary Output Formats
I added a second version of the binary format that is now a single file. And also parquet, feather, arrow and beve.
It looks very much like running your own binary format wins in terms of being light and fast. The only drawback compared to csv is that it is no longer human-readable. All of the other options have rather heavy dependencies and don’t really offer benefits from the perspective of just writing and reading the draws to and from the disk fast.

spinkney · January 22, 2026, 11:52pm

This is great! Thank you for putting this together!

Can you put a table with one sentence about what each format/reader are (and a link to a repo, if it exists)?

@WardBrian how do we deal with complex numbers today in the csv? Is this something we want a binary format to natively support?

WardBrian · January 23, 2026, 12:04am

Right now it gets output as two columns, something like foo.real,foo.imag

It would be great if a output format supported this (or any form of typed output for that matter, e.g. integers separated from doubles, or arrays actually having all their data associated with each other), but that would certainly require extra machinery that the current IO code lacks

scholz · January 23, 2026, 12:48am

I updated the report with a table. Hope that makes it clear.

robertgrant · January 23, 2026, 10:33am

I don’t know enough about binary formats to add anything here but I just want to say a big thank you for work on this. Hard drives everywhere are grateful.

scholz · January 24, 2026, 4:06pm

And another tiny update: I played around with moving the writing to it’s own thread and also added compression just for fun.

Model: 10,007 parameters, 4,000 samples (306 MB uncompressed)

Configuration	Time	File Size	Notes
Sync (baseline)	20.28s	306M	Single-threaded write
Async, no batching	19.73s	306M	Ring buffer, write per sample
Async, batch=64	20.25s	306M	Batch 64 rows, then write
Async, batch=256	20.21s	306M	Batch 256 rows, then write
Async, batch=1024	20.45s	306M	Batch 1024 rows, then write
Async, batch=ALL	20.93s	306M	Buffer everything, write at end
Async + ZSTD, batch=64	20.30s	293M	Compress every 64 rows
Async + ZSTD, batch=256	20.31s	293M	Compress every 256 rows
Async + ZSTD, batch=ALL	21.44s	293M	Compress everything at end

Looks like for large models sampling completely dominates the time and with samples essentially being random draws they are hard to compress. Only real benefit would be the non-sample columns but the row-major format puts those far away from each other and buffering the entire result before writing (which would theoretically give you the best compression) requires more memory again. For small models it’s not worth it in general. Haven’t looked at read times but I assume they won’t be significantly different as well.
I guess you could make the argument that compression is almost free in terms of time, so why not add it? But that adds the zstd dependency to cmdstan for a tiny improvement in terms of required storage.

Topic		Replies	Views
Alternative .csv reader Developers maintenance , rstan	53	4081	September 11, 2018
CmdStan dataset loading speed Developers cmdstan	21	1430	February 14, 2020
Speed difference between rstan and cmdstan for a simple model CmdStan rstan , techniques	25	3620	November 7, 2021
JSON Output for STAN Developers	30	1112	October 3, 2023
Notes on Stan Output Serialization Options (YAML, Protobuf, Avro, CBOR) Developers	13	3265	July 14, 2021

Proof of concept: Binary output format for cmdstan

Related topics