Proof of concept: Binary output format for cmdstan

Compressed files also lose the ability to be easily mmap’d into arrays in the downstream language

1 Like

Yea, I don’t think there is any relevant upside to it. I was just curious and wanted to try it :)

1 Like

Not saying that this would be in “any” way practically useful, but surely it should be possible to get a better compression than ~4%? 🤔 also not trying to nerdsnipe anyone!

It’s hard because the draws are kinda random and that is b far the most data that there is. Funnily enough, csv is easier to compress.

I would like to add couple of comments when thinking of binary formats

  • read all in memory
  • read only parts of the parameters to memory
  • stream to memory
  • chunking + parallel reading

There are caveats with all of these things, but I also recommend thinking about how these files should be used. E.g. having option to lazy read will be beneficial for larger models with large sample count.

  • compression vs no-compression

Applying compression is something that in the long run will be beneficial for most users. I would not recommend using compression over the whole file but compression with chunking should work better. Also please check different compression methods (I have had data where zstd and other common compression methods did not offer (almost) any compression but e.g. blosc/blosc2 compression could compress the data to 0.5-0.6 size).

  • metadata

I would recomment having an option to include suitable metadata in the same file.

Then there is ofc the question, should we keep data as a table format (current assumption is wide format) or could they exists as nD structures (e.g. hdf5/netcdf/zarr). For example ArviZ keeps the data as nD format with named dimensions and other information.

On compression:

  • CSVs can be compressed even if the contents would be random, as they use more bits than needed to present numbers 0-9, comma, and dot. In addition we used to save only 6 and now save only 8 digits, which make the content less random.
  • For doubles, most of the bits are random even when we have high autocorrelation or the leading digits of the variable are constant. So it’s unlikely that we can compress much.
  • Reason to use binary format is not only to improve write speed. Important part is to have the same behavior whether generated quantities is run at the sampling time or afterwards. Sampling time generated quantities use double values from memory, but after sampling generated quantities use the accuracy used to write CSV. In addition with more different types of constrained variable types in Stan, make it more likely that even if most of the bits are random for practical purposes we need high accuracy to keep good behavior for constraints.
  • Thus, I would not spend time on compression
1 Like

I’m with Aki re. Compression. Just moving to binary results in ~30% smaller files compared to csv and most of the content is too random to be compressible. It also adds a dependency to both the writer and reader, which takes away one of the benefits of the custom binary format.

I’d also argue for writing in row-major as that’s how the data arrive. If your model is so large that having a column major format would significantly benefit you, there’s always the option to postprocess once before analysis. But I don’t think cmdstan is the right layer to support that.

1 Like

As we’re already talking about compression/file size, I’d assume a good general solution would be to have a mode where CmdStan only writes out the unconstrained/sampler parameters in a binary format. That would give you bitwise reproducibility (disregarding rng functions), and a smaller file size.

Though of course part of the appeal of CmdStan is that it writes out whatever the user specifies.

(I do not think that gzip like compression is worth the drawbacks, I was mostly curious about what’s possible!)

1 Like

I did some testing and with double (float64) and int32

import blosc2
import numpy as np
arr = np.random.randn(1_000_000,10)
# 8000000 bytes
arr_bytes = arr.tobytes()
arr_compressed = blosc2.pack_array(
    arr,
    clevel=9,
    filter=blosc2.Filter.BITSHUFFLE, 
    codec=blosc2.Codec.ZSTD
)
print(len(arr_compressed)/len(arr_bytes))
# 0.8828244875

arr_int = np.random.randint(-1000,1000, size=2_000_000).astype(np.int32)
# 8000000 bytes
arr_int_bytes = arr_int.tobytes()
arr_int_compressed = blosc2.pack_array(
    arr_int, 
    clevel=9, 
    filter=blosc2.Filter.BITSHUFFLE, 
    codec=blosc2.Codec.ZSTD
)

print(len(arr_int_compressed)/len(arr_int_bytes))
# 0.375211

So 10% compression for doubles and 60% for integers.

1 Like

There’s only 2k possible different integers so in your 2m sample size there will be a lot of duplicates. I would expect compression rate to drop significantly when you increase the possible outcome space.

1 Like

Can you send the code for these benchmarks? There is a lot of weird minutia to concurrent stuff that can hide a lot of gains you would normally see. Happy to take a look!

I also agree with this. Lets get an mvp pr up and then when can talk about extras like compression.

1 Like

The first set is in the report and the one I posted here was just a bunch of time’d bash calls with the respective arguments.