Notes on Stan Output Serialization Options (YAML, Protobuf, Avro, CBOR)

Context: In early January 2020, @rok_cesnovar proposed a method for serializing Stan output using JSON. This post tries to continue the discussion.

Any one of the following serialization formats could be used to improve serialization of Stan output: YAML, Protobuf, Avro, CBOR. (JSON is not an option as it cannot encode NaN and infinity.) The current non-standard CSV format used by stream_writer.hpp is difficult for newcomers to understand and requires writing custom parsing code. (In particular, fully parsing the output of the sample writer requires a state machine.) Replacing the existing format with something better—if such a thing exists—is desirable.

This post summarizes the advantages and disadvantages of these different formats. All of these formats are mature and used by hundreds if not thousands of large organizations and firms. Python and R can read these formats. All of these formats can be predictably translated to and from JSON.

TL;DR: YAML and CBOR are worth looking at.

Formats

YAML

YAML is widely-used. It is more expressive than JSON. For example, YAML distinguishes between floating-point numbers and integers (JSON does not). Floating-point NaN and infinity values can be encoded.

Advantages:

  • Human-readable
  • Used everywhere.

Disadvantages:

  • Slow and inefficient for doubles. Text-based encoding of double values is particularly inefficient. Encoding, say, the double value â…“ uses an 18-byte UTF-8 string.
  • It’s not universally loved. Widely regarded as having too many features.

Protocol Buffers

Well-known and widely-used binary serialization format developed and used by the advertising and consumer-surveillance firm Google. It uses and requires a schema written in the Protocol Buffers language. This schema is used to generate code in whatever language one wants (e.g., Python, R, C++).

If our data output remains simple—mostly draws and a tiny bit of metadata—distributing a schema and doing the code generation adds considerable complexity.

Advantages:

  • Very fast. Binary encoding of floating-point numbers.
  • Mature tooling.
  • Some Stan devs have experience using it (httpstan uses it).

Disadvantages:

  • Not human-readable.
  • Requires writing and distributing a schema file using the Protobuf language.
  • Requires a code generation tool.
  • Adoption outside of Google is not high (“Bits on the Wire” by Tim Bray).
  • Open-source but development controlled entirely by Google (Ă  la Android and Chrome).
  • Using the Protobuf C++ and Python API is not pleasant. This sentiment seems to be widely shared. (It’s also my experience.)

Avro

Apache Avro occupies the same space as Protocol Buffers. Avro may have greater adoption than Protobuf. (It’s typically used as the serialization format for Apache Kafka.) Requires a schema. Avro schemas are written in JSON. The schema language seems simpler than the one used by protobuf. There’s also less code-generation required.

Again, if our data output remains simple—mostly draws and a tiny bit of metadata—distributing a schema and doing the code generation adds complexity.

Advantages:

  • Very fast. Binary encoding of floating-point values.
  • Mature tooling.
  • Supported by the Apache Foundation.

Disadvantages:

  • Not human-readable.
  • Requires writing and distributing a schema file using the JSON-based Avro language.
  • Requires a code generation tool.

CBOR

CBOR is the most popular “binary JSON” format. It’s an IETF standard (RFC
7049
). Unlike JSON, CBOR can distinguish
between integer and IEEE float types.

Advantages:

  • Very fast. Binary encoding of floating-point values.
  • IETF standard.
  • Schema-less. Works essentially like JSON.
  • Developer-friendly API. Python, R, and C++ APIs are going to be virtually identical to the text-based JSON API.

Disadvantages:

  • Not human-readable.
  • Newer than other formats. Lower adoption by organizations and firms.

Discussion

This post tries to summarize the advantages and disadvantages of each format for a very narrow use case: serializing Stan output. My loosely-held belief is that CBOR and YAML would work well for Stan.

Thanks again to @rok_cesnovar for (re)starting this discussion. Thanks also to @krzysztofsakrejda for his contributions to an earlier version of this discussion.

3 Likes

I’m a little unclear on scope:

  • Is this meant to be a serialization standard for a single chain or for multiple chains?
  • Is it meant to represent full input config to allow replication?
  • Is it meant to include all adaptation information to allow restarting?

Are the tools flexible enough in their schemas that we don’t have to push metadata out with every draw? That seems like it’d be a dealbreaker.

From the protobuf doc, they don’t recommend using it for large (> 1MB) data sets, but I’m not sure why.

I’m confused about what’s going on now with JSON and whether we need different JSON formats for CmdStanPy and CmdStanR. @mitzimorris—do you know?

This seems to contradict the previous point. Could you clarify?

Are the draws still in standard CSV format with header and values and all other data encoded in comments?

That’s the mimimum you can get away with for a full double-precision ASCII encoding of 1/3. Here’s what the Wikipedia page on floating point says about double-precision and ASCII:

If an IEEE 754 double-precision number is converted to a decimal string with at least 17 significant digits, and then converted back to double-precision representation, the final result must match the original number.[1]

Why? These both seem straightforard if the schema is static and we only do the code generation once and the result is C++.

Do you have experience to report on the other tools ease of use? I’ve never even heard of CBOR or Avro and only know YAML as the terrible format used to config bookdown and our web pages.

Edit: I forgot to add that binary formats will take up a lot more space for round numbers than ASCII. For example, 1 and 0 can be represented in two bytes in ASCII (EOS will eat up a byte), but take 8 bytes in double or long int, or 4 bytes as int.

I found this discussion useful as it talks about the serialization formats of Avro and protobuf (and Thrift, from [insert adjectives here] Facebook):

https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html

But it’s 7 years out of date.

Also, is the main goal here consistency across interfaces, speed, or compactness of output?

Edit: Just found an analysis from Criteo:

For small files, Avro has a high constant overhead. For larger files, Avro and protobuf are neck and neck, but Thrift was around twice as fast at serializing large files (40MB binary output).
For large files, they crown Thrift the winner, with Avro and protobuf trailing for large files.

Was HDF5 considered? This is very well established in the scientific community and its been around forever.

1 Like

Thanks for posting this! What in particular do you like about CBOR? Is there a mature’ish C++ package that can work with this format (or at least a stable one)? I think we’d also have to check if it’s available in R and Python (there is a protobuff R package)

I would guess the adoption is even less than protobuff but cap’n proto also seem pretty neat!

https://capnproto.org/

Though it carries that same baggage as protobuffers and ftr I have zero experience with this sort of thing

Yes, there’s a popular, mature C++ library for JSON which also supports
CBOR: https://github.com/nlohmann/json (17.2k github stars)

1 Like

The scope is output: everything emitted to a logger or a writer when one calls a sampling function in stan::services. (In a call they have names logger, init_writer, sample_writer, and diagnostic_writer).

There are hacks for NaN and inf. Not part of the JSON standard.

In essence, yes.

Surely having no schema and no code generation step is less complex, right? (YAML, JSON, CSV require no schema.)

I have used Avro (in Python). If neither YAML nor CBOR are suitable, I’d go with it over Protobuf. I’ve used YAML in a lot of settings.

YAML is not great. One does get used to it. (JSON also has its quirks, in addition to those we already mentioned.) YAML would definitely work for Stan output. It would be a clear improvement over the current CSV+metadata mix. CBOR (or Avro) would, likewise, be a big improvement.

The Martin Kleppman post is excellent, btw. Reading it alongside Tim Bray’s post on serialization gives you a great survey of where we stand.

I haven’t seen HDF5 used for (incremental) serialization like CSV/JSON/YAML/etc, only for storage.

We discussed HDF5 but compared to something like CBOR it cuts you off from being able to write simple custom parsers for output because it’s a very complex format. The C api seems good but the corresponding C++ wrapper has long-standing problems (had last I checked) that indicates a broken-ish maintenance process… so I would stick to CBOR for a binary format.

I’ve written a prototype of CmdStan using purely Capnproto and it’s a PITA to encode the entire input hierarchy in it (similar to the current pile of pointers) but for output it’s good and really fast. I’ve also done a hand-rolled binary format as a prototype.

The real issue isn’t the underlying tech (we could always switch it out or diversify). For me the next step to making this happen is to make a schema for our data that’s technology-independent. You can always pack bytes into something that has multi-language support (Protobuf/CBOR/capnproto/flatbuffers/etc… would all work), personally I’m leaning towards CBOR because it’s a common standard one step above custom byte-packing and some of the C++ libraries support YAML/JSON as alternative IO.

Possibly I don’t have sufficient expertise in this area to understand what you’re getting at, but I have been using h5py for years and hdf5r (successor to h5) for a decent amount of time too, and both seem to be pretty solid. Or were you referring to something else?

I’m guessing that h5py goes through the C api so avoids the C++ api problems, it wasn’t anything against HDF5, just that we would be leaning on the C++ API a lot and if it doesn’t get updated reliably in response to bugs that’s something to consider. My post was a long time ago so things may be better now!

If your interest is in serializing Stan output, then arviz has your back! The [InferenceData](https://arviz-devs.github.io/arviz/schema/PyStan_schema_example.html) object contains everything you need, and it already exists for PyStan.

NetCDF is the main serialization output that xarray (which arviz builds on top of) uses. But xarray is also completely interconvertable with pandas, so it is able to leverage all the I/O formats that pandas supports.

I’ve been looking at options in this space lately. I do have an affinity for most of the ID schema, though programmatically keeping to spec isn’t really possible the way the Stan language is structured (for example there’s no way to distinguish GQs that are part of the prior-predictive group versus posterior predictive group beyond establishing a variable-naming convention or adding entirely new program blocks to the Stan language).

ID as a schema can be implemented across a variety of array-oriented storage structures, hence the ArViZ use of xarray, which in turn includes options including NetCDF4 (a subclass of hdf5), Zarr, etc. I previously advocated the IO rewrite adopt hdf5, but after working with very large (10k+ parameters) models and encountering the serial-write bottleneck of hdf5, I’ve started to feel the lure of Zarr despite its lack of maturity. The efficient remote access of Zarr is also appealing for projects like posteriodDB.

But the existence of good options hasn’t been the problem; it’s a lack of convergence on a consensus and/or availability of expertise and time to actually implement one of them.