Context: In early January 2020, @rok_cesnovar proposed a method for serializing Stan output using JSON. This post tries to continue the discussion.
Any one of the following serialization formats could be used to improve serialization of Stan output: YAML, Protobuf, Avro, CBOR. (JSON is not an option as it cannot encode NaN and infinity.) The current non-standard CSV format used by stream_writer.hpp
is difficult for newcomers to understand and requires writing custom parsing code. (In particular, fully parsing the output of the sample writer requires a state machine.) Replacing the existing format with something better—if such a thing exists—is desirable.
This post summarizes the advantages and disadvantages of these different formats. All of these formats are mature and used by hundreds if not thousands of large organizations and firms. Python and R can read these formats. All of these formats can be predictably translated to and from JSON.
TL;DR: YAML and CBOR are worth looking at.
Formats
YAML
YAML is widely-used. It is more expressive than JSON. For example, YAML distinguishes between floating-point numbers and integers (JSON does not). Floating-point NaN and infinity values can be encoded.
Advantages:
- Human-readable
- Used everywhere.
Disadvantages:
- Slow and inefficient for doubles. Text-based encoding of double values is particularly inefficient. Encoding, say, the double value â…“ uses an 18-byte UTF-8 string.
- It’s not universally loved. Widely regarded as having too many features.
Protocol Buffers
Well-known and widely-used binary serialization format developed and used by the advertising and consumer-surveillance firm Google. It uses and requires a schema written in the Protocol Buffers language. This schema is used to generate code in whatever language one wants (e.g., Python, R, C++).
If our data output remains simple—mostly draws and a tiny bit of metadata—distributing a schema and doing the code generation adds considerable complexity.
Advantages:
- Very fast. Binary encoding of floating-point numbers.
- Mature tooling.
- Some Stan devs have experience using it (httpstan uses it).
Disadvantages:
- Not human-readable.
- Requires writing and distributing a schema file using the Protobuf language.
- Requires a code generation tool.
- Adoption outside of Google is not high (“Bits on the Wire” by Tim Bray).
- Open-source but development controlled entirely by Google (Ă la Android and Chrome).
- Using the Protobuf C++ and Python API is not pleasant. This sentiment seems to be widely shared. (It’s also my experience.)
Avro
Apache Avro occupies the same space as Protocol Buffers. Avro may have greater adoption than Protobuf. (It’s typically used as the serialization format for Apache Kafka.) Requires a schema. Avro schemas are written in JSON. The schema language seems simpler than the one used by protobuf. There’s also less code-generation required.
Again, if our data output remains simple—mostly draws and a tiny bit of metadata—distributing a schema and doing the code generation adds complexity.
Advantages:
- Very fast. Binary encoding of floating-point values.
- Mature tooling.
- Supported by the Apache Foundation.
Disadvantages:
- Not human-readable.
- Requires writing and distributing a schema file using the JSON-based Avro language.
- Requires a code generation tool.
CBOR
CBOR is the most popular “binary JSON” format. It’s an IETF standard (RFC
7049). Unlike JSON, CBOR can distinguish
between integer and IEEE float types.
Advantages:
- Very fast. Binary encoding of floating-point values.
- IETF standard.
- Schema-less. Works essentially like JSON.
- Developer-friendly API. Python, R, and C++ APIs are going to be virtually identical to the text-based JSON API.
Disadvantages:
- Not human-readable.
- Newer than other formats. Lower adoption by organizations and firms.
Discussion
This post tries to summarize the advantages and disadvantages of each format for a very narrow use case: serializing Stan output. My loosely-held belief is that CBOR and YAML would work well for Stan.
Thanks again to @rok_cesnovar for (re)starting this discussion. Thanks also to @krzysztofsakrejda for his contributions to an earlier version of this discussion.