Proof of concept: Binary output format for cmdstan

WardBrian · January 24, 2026, 6:19pm

Compressed files also lose the ability to be easily mmap’d into arrays in the downstream language

scholz · January 24, 2026, 6:22pm

Yea, I don’t think there is any relevant upside to it. I was just curious and wanted to try it :)

Niko · January 24, 2026, 8:06pm

Not saying that this would be in “any” way practically useful, but surely it should be possible to get a better compression than ~4%? 🤔 also not trying to nerdsnipe anyone!

scholz · January 24, 2026, 9:34pm

It’s hard because the draws are kinda random and that is b far the most data that there is. Funnily enough, csv is easier to compress.

ahartikainen · January 27, 2026, 9:02am

I would like to add couple of comments when thinking of binary formats

read all in memory
read only parts of the parameters to memory
stream to memory
chunking + parallel reading

There are caveats with all of these things, but I also recommend thinking about how these files should be used. E.g. having option to lazy read will be beneficial for larger models with large sample count.

compression vs no-compression

Applying compression is something that in the long run will be beneficial for most users. I would not recommend using compression over the whole file but compression with chunking should work better. Also please check different compression methods (I have had data where zstd and other common compression methods did not offer (almost) any compression but e.g. blosc/blosc2 compression could compress the data to 0.5-0.6 size).

metadata

I would recomment having an option to include suitable metadata in the same file.

Then there is ofc the question, should we keep data as a table format (current assumption is wide format) or could they exists as nD structures (e.g. hdf5/netcdf/zarr). For example ArviZ keeps the data as nD format with named dimensions and other information.

avehtari · January 27, 2026, 12:48pm

On compression:

CSVs can be compressed even if the contents would be random, as they use more bits than needed to present numbers 0-9, comma, and dot. In addition we used to save only 6 and now save only 8 digits, which make the content less random.
For doubles, most of the bits are random even when we have high autocorrelation or the leading digits of the variable are constant. So it’s unlikely that we can compress much.
Reason to use binary format is not only to improve write speed. Important part is to have the same behavior whether generated quantities is run at the sampling time or afterwards. Sampling time generated quantities use double values from memory, but after sampling generated quantities use the accuracy used to write CSV. In addition with more different types of constrained variable types in Stan, make it more likely that even if most of the bits are random for practical purposes we need high accuracy to keep good behavior for constraints.
Thus, I would not spend time on compression

scholz · January 27, 2026, 3:59pm

I’m with Aki re. Compression. Just moving to binary results in ~30% smaller files compared to csv and most of the content is too random to be compressible. It also adds a dependency to both the writer and reader, which takes away one of the benefits of the custom binary format.

I’d also argue for writing in row-major as that’s how the data arrive. If your model is so large that having a column major format would significantly benefit you, there’s always the option to postprocess once before analysis. But I don’t think cmdstan is the right layer to support that.

Niko · January 27, 2026, 4:36pm

As we’re already talking about compression/file size, I’d assume a good general solution would be to have a mode where CmdStan only writes out the unconstrained/sampler parameters in a binary format. That would give you bitwise reproducibility (disregarding rng functions), and a smaller file size.

Though of course part of the appeal of CmdStan is that it writes out whatever the user specifies.

(I do not think that gzip like compression is worth the drawbacks, I was mostly curious about what’s possible!)

ahartikainen · January 28, 2026, 1:43pm

I did some testing and with double (float64) and int32

import blosc2
import numpy as np
arr = np.random.randn(1_000_000,10)
# 8000000 bytes
arr_bytes = arr.tobytes()
arr_compressed = blosc2.pack_array(
    arr,
    clevel=9,
    filter=blosc2.Filter.BITSHUFFLE, 
    codec=blosc2.Codec.ZSTD
)
print(len(arr_compressed)/len(arr_bytes))
# 0.8828244875

arr_int = np.random.randint(-1000,1000, size=2_000_000).astype(np.int32)
# 8000000 bytes
arr_int_bytes = arr_int.tobytes()
arr_int_compressed = blosc2.pack_array(
    arr_int, 
    clevel=9, 
    filter=blosc2.Filter.BITSHUFFLE, 
    codec=blosc2.Codec.ZSTD
)

print(len(arr_int_compressed)/len(arr_int_bytes))
# 0.375211

So 10% compression for doubles and 60% for integers.

scholz · January 28, 2026, 1:54pm

There’s only 2k possible different integers so in your 2m sample size there will be a lot of duplicates. I would expect compression rate to drop significantly when you increase the possible outcome space.

stevebronder · January 28, 2026, 3:56pm

Can you send the code for these benchmarks? There is a lot of weird minutia to concurrent stuff that can hide a lot of gains you would normally see. Happy to take a look!

I also agree with this. Lets get an mvp pr up and then when can talk about extras like compression.

scholz · January 28, 2026, 4:41pm

The first set is in the report and the one I posted here was just a bunch of time’d bash calls with the respective arguments.

Topic		Replies	Views
Alternative .csv reader Developers maintenance , rstan	53	4081	September 11, 2018
CmdStan dataset loading speed Developers cmdstan	21	1430	February 14, 2020
Speed difference between rstan and cmdstan for a simple model CmdStan rstan , techniques	25	3620	November 7, 2021
JSON Output for STAN Developers	30	1112	October 3, 2023
Notes on Stan Output Serialization Options (YAML, Protobuf, Avro, CBOR) Developers	13	3265	July 14, 2021

Proof of concept: Binary output format for cmdstan

Related topics