Compressed files also lose the ability to be easily mmap’d into arrays in the downstream language
Yea, I don’t think there is any relevant upside to it. I was just curious and wanted to try it :)
Not saying that this would be in “any” way practically useful, but surely it should be possible to get a better compression than ~4%? 🤔 also not trying to nerdsnipe anyone!
It’s hard because the draws are kinda random and that is b far the most data that there is. Funnily enough, csv is easier to compress.
I would like to add couple of comments when thinking of binary formats
- read all in memory
- read only parts of the parameters to memory
- stream to memory
- chunking + parallel reading
There are caveats with all of these things, but I also recommend thinking about how these files should be used. E.g. having option to lazy read will be beneficial for larger models with large sample count.
- compression vs no-compression
Applying compression is something that in the long run will be beneficial for most users. I would not recommend using compression over the whole file but compression with chunking should work better. Also please check different compression methods (I have had data where zstd and other common compression methods did not offer (almost) any compression but e.g. blosc/blosc2 compression could compress the data to 0.5-0.6 size).
- metadata
I would recomment having an option to include suitable metadata in the same file.
Then there is ofc the question, should we keep data as a table format (current assumption is wide format) or could they exists as nD structures (e.g. hdf5/netcdf/zarr). For example ArviZ keeps the data as nD format with named dimensions and other information.
On compression:
- CSVs can be compressed even if the contents would be random, as they use more bits than needed to present numbers 0-9, comma, and dot. In addition we used to save only 6 and now save only 8 digits, which make the content less random.
- For doubles, most of the bits are random even when we have high autocorrelation or the leading digits of the variable are constant. So it’s unlikely that we can compress much.
- Reason to use binary format is not only to improve write speed. Important part is to have the same behavior whether generated quantities is run at the sampling time or afterwards. Sampling time generated quantities use double values from memory, but after sampling generated quantities use the accuracy used to write CSV. In addition with more different types of constrained variable types in Stan, make it more likely that even if most of the bits are random for practical purposes we need high accuracy to keep good behavior for constraints.
- Thus, I would not spend time on compression
I’m with Aki re. Compression. Just moving to binary results in ~30% smaller files compared to csv and most of the content is too random to be compressible. It also adds a dependency to both the writer and reader, which takes away one of the benefits of the custom binary format.
I’d also argue for writing in row-major as that’s how the data arrive. If your model is so large that having a column major format would significantly benefit you, there’s always the option to postprocess once before analysis. But I don’t think cmdstan is the right layer to support that.
As we’re already talking about compression/file size, I’d assume a good general solution would be to have a mode where CmdStan only writes out the unconstrained/sampler parameters in a binary format. That would give you bitwise reproducibility (disregarding rng functions), and a smaller file size.
Though of course part of the appeal of CmdStan is that it writes out whatever the user specifies.
(I do not think that gzip like compression is worth the drawbacks, I was mostly curious about what’s possible!)
I did some testing and with double (float64) and int32
import blosc2
import numpy as np
arr = np.random.randn(1_000_000,10)
# 8000000 bytes
arr_bytes = arr.tobytes()
arr_compressed = blosc2.pack_array(
arr,
clevel=9,
filter=blosc2.Filter.BITSHUFFLE,
codec=blosc2.Codec.ZSTD
)
print(len(arr_compressed)/len(arr_bytes))
# 0.8828244875
arr_int = np.random.randint(-1000,1000, size=2_000_000).astype(np.int32)
# 8000000 bytes
arr_int_bytes = arr_int.tobytes()
arr_int_compressed = blosc2.pack_array(
arr_int,
clevel=9,
filter=blosc2.Filter.BITSHUFFLE,
codec=blosc2.Codec.ZSTD
)
print(len(arr_int_compressed)/len(arr_int_bytes))
# 0.375211
So 10% compression for doubles and 60% for integers.
There’s only 2k possible different integers so in your 2m sample size there will be a lot of duplicates. I would expect compression rate to drop significantly when you increase the possible outcome space.
Can you send the code for these benchmarks? There is a lot of weird minutia to concurrent stuff that can hide a lot of gains you would normally see. Happy to take a look!
I also agree with this. Lets get an mvp pr up and then when can talk about extras like compression.
The first set is in the report and the one I posted here was just a bunch of time’d bash calls with the respective arguments.
Yes, this is true, but this could also be true if there are many generated quantities etc.
Also let’s be explicit, my example was not to show how great compression methods are but just to put some numbers in the discussion. Like @avehtari said most double numbers are random which will limit the compression potential.
@jonah (asking you because you started this :p but maybe Aki or Bob would be the right ones?)
Would you like me to file a PR with the Single-File stanbin format (documented in the 2nd report) to cmdstan?
I put it at the top of the report but just for clarity, I have basically zero advanced C++ knowledge and the entire implementation was written by Claude. I mainly thought about it on a conceptual level.
I’m happy if someone takes this and runs with it but I can also just file the pr to have a concrete starting point.
I could also file a corresponding PR to cmdstanr for the reader?
The best way to proceed is to probably have a more formal design discussion of the file format, at least as a cmdstan issue if not as a PR to GitHub - stan-dev/design-docs
Thanks for pushing on this!
I think the binary file format is a big change and would benefit from design document PR and discussion there, before making a PR for the code
I think the binary file format is a big change and would benefit from design document PR and discussion there, before making a PR for the code
I think starting with what @WardBrian and @avehtari suggested makes sense for a change this big.
Indeed, thanks @scholz!
Just for awareness this discussion has happened a handful of times, there should be at least one design doc and a few very long discourse threads. They’re worth reading…
TL; DR: the hardest part has always been getting various parts of the Stan protect to just agree to go ahead with something
I would disagree with this characterization. In my opinion, the issue hasn’t been finding consensus, it has been that (thus far) said consensus was around a perhaps-too-ambitious option.
For example, I think there was good consensus that Arrow would be a nice option to provide. The reason it still is not an option is that using Arrow well would require a redesign of the stan::io::writer interface to capture more information than it currently does about shapes and types of variables. Some historical poor factorization of the algorithm code makes touching this interface extremely painful, so Arrow ended up ‘blocked’ by the desire to refactor the Stan library internals, which is a project that is both quite large and (on its own) relatively low priority, since it would mean a lot of work on code that is otherwise slow-moving and working.
All this to say that I think “simpler” formats, like the one discussed here, have a significantly higher chance of actually making it into user’s hands. By giving up on some of the previous desiderata (storing integers as integers, including shape metadata in the file format, etc) and instead basically doing the same flat table as CSV, just in raw bytes, you reduce the lift from “refactor most of stan-dev/stan” to “add one or two files to stan-dev/cmdstan”