There’s intermittently been talk of switching away from csv to something else. I suggested hdf5 (due to my familiarity and the single-writer-multiple-readers feature), but lots of others have been discussed and last I recall I think protobuf was the top contender for a new format. Is there anywhere central where I can follow discussions/progress on this topic?
I believe this was the last thread on that topic: Notes on Stan Output Serialization Options (YAML, Protobuf, Avro, CBOR)
There was my design docs on JSON output that we closed because of the issues discovered of handling Nan/Inf in the most popular R and Python JSON libs. We would essentially had to maintain different dialects and maintainer of the R jsonlite package denied the feature request to unify with Python’s libs.
Other than that AFAIK no work is going on here, at least no one has mentioned it.
the “what should that something else be” is a bit of a bike shed issue.
the more interesting question is what information at what level or granularity to we want from the algorithms, and how to design this so that the services layer wraps them in a way that makes it easy to stream. there’s this design doc from 2018 - https://github.com/stan-dev/design-docs/blob/master/designs/0001-logger-io.md
Excuse my tangent if this isn’t pertinent (I don’t feel sufficiently expert in this area to really speak with confidence) but I wonder if these kind of open questions suggest a so-called self-describing format like hdf5 might be best such that we can use the header field to denote version numbers and link version numbers to specific choices for the things your talking about? Subsequent interfaces would then simply check the version and assume that versions specific features when decoding.
I don’t know enough about this either - that said, self-describing formats sound like a good thing.
I think we need a design document for this if the goal is to gather any
sort of rough consensus.
I also think there are two different (potential) IO re-factors. The
first concerns what kind of format should be used by the callback
writers in Stan. Currently these write CSV lines (with some exceptions).
They could write something else (e.g., JSON, CBOR, Protobuf). Note that
serialization formats which cannot stream data are not useful in this
case. The second re-factor concerns the serialization of the draws and
metadata after sampling has finished. CSV is used here but Arrow or HDF
or Parquet would likely be better.
I was slightly involved in the discussions that led to the design document @mitzimorris shared. I think there are IMHO at least two tangled issues:
- An abstraction over the output in code (i.e. internal interface) - currently what is passed around is vectors of doubles or vectors of strings. It would be beneficial if this could be extended so that (an incomplete list of stuff I recall, not an official agreement of the team):
- Switching some output streams on/off (diagnostics, unconstrained params, …) is easy. Avoiding performance hit for evaluating whether an output is on/off.
- Unifying the outputs of all methods (sampling, optimizing, ADVI, potentially something in the future)
- Additional streams of diagnostics that have different format / frequency than “a set of values per iteration” can be created (the original discussion was motivated by a desire to stream more details about divergent trajectories out of the sampler)
- Type information is maintained - most notably, in current implementation, int values from gen. quants are converted to double and some info (I think adaptation etc.) is streamed as string.
- Choosing a target format for the serialization (i.e. the outside interface) - which seems to be the present concern.
Now 1) has also been a bit of a bike shed and the discussion didn’t really move forward (I admit I was sometimes less than ideal part of those conversations). But I think solving 2) would be easier and a bit less contentious if 1) was implemented well, as supporting additional formats/switching to those formats would be easier.
Hope you will be able to move this forward now :-)
These are separate issues in part, because a solution to (1) could support multiple solutions to (2). But we can’t just jump to (2), because (1) determines what gets serialized in (2).
I wouldn’t have characterized the prior discussions as bikeshedding so much as everyone having different opinions on the specifics of (2). That’s true even for our current set of outputs and even more of a problem for new proposals like adding trajectory information optionally. Issues arose such as whether we need a binary or human-readable serialization format, whether we needed metadata on every row of output, whether NaN and Infinity and denormalized numbers were in scope, whether various kinds of output got serialized to the same stream (file) or separate ones, and similar issues. Bikeshedding would’ve been debating the file names or the form of text output for time stamps that were never going to be parsed programatically.
We also had differing opinions on (1), specifically on whether we wanted to go with something like @seantalts’s proposal of converting all the writers to static like a logger pattern and if not, how many new writers we needed to support new output and how they’d be organized so they could be used by CmdStan, RStan, and PyStan. The issue there is that these all have different requirements and different things which are hard or easy depending on choices that are made. We couldn’t even settle on Eigen vs. std::vector data structures on the inside because of issues about whether callbacks would be easy to write in Python or R.