Status of the IO re-factor?

There’s intermittently been talk of switching away from csv to something else. I suggested hdf5 (due to my familiarity and the single-writer-multiple-readers feature), but lots of others have been discussed and last I recall I think protobuf was the top contender for a new format. Is there anywhere central where I can follow discussions/progress on this topic?

3 Likes

I believe this was the last thread on that topic: Notes on Stan Output Serialization Options (YAML, Protobuf, Avro, CBOR)

There was my design docs on JSON output that we closed because of the issues discovered of handling Nan/Inf in the most popular R and Python JSON libs. We would essentially had to maintain different dialects and maintainer of the R jsonlite package denied the feature request to unify with Python’s libs.

Other than that AFAIK no work is going on here, at least no one has mentioned it.

2 Likes

the “what should that something else be” is a bit of a bike shed issue.

the more interesting question is what information at what level or granularity to we want from the algorithms, and how to design this so that the services layer wraps them in a way that makes it easy to stream. there’s this design doc from 2018 - https://github.com/stan-dev/design-docs/blob/master/designs/0001-logger-io.md

2 Likes

Excuse my tangent if this isn’t pertinent (I don’t feel sufficiently expert in this area to really speak with confidence) but I wonder if these kind of open questions suggest a so-called self-describing format like hdf5 might be best such that we can use the header field to denote version numbers and link version numbers to specific choices for the things your talking about? Subsequent interfaces would then simply check the version and assume that versions specific features when decoding.

I don’t know enough about this either - that said, self-describing formats sound like a good thing.

1 Like

I think we need a design document for this if the goal is to gather any
sort of rough consensus.

I also think there are two different (potential) IO re-factors. The
first concerns what kind of format should be used by the callback
writers in Stan. Currently these write CSV lines (with some exceptions).
They could write something else (e.g., JSON, CBOR, Protobuf). Note that
serialization formats which cannot stream data are not useful in this
case. The second re-factor concerns the serialization of the draws and
metadata after sampling has finished. CSV is used here but Arrow or HDF
or Parquet would likely be better.

2 Likes

I was slightly involved in the discussions that led to the design document @mitzimorris shared. I think there are IMHO at least two tangled issues:

  1. An abstraction over the output in code (i.e. internal interface) - currently what is passed around is vectors of doubles or vectors of strings. It would be beneficial if this could be extended so that (an incomplete list of stuff I recall, not an official agreement of the team):
    • Switching some output streams on/off (diagnostics, unconstrained params, …) is easy. Avoiding performance hit for evaluating whether an output is on/off.
    • Unifying the outputs of all methods (sampling, optimizing, ADVI, potentially something in the future)
    • Additional streams of diagnostics that have different format / frequency than “a set of values per iteration” can be created (the original discussion was motivated by a desire to stream more details about divergent trajectories out of the sampler)
    • Type information is maintained - most notably, in current implementation, int values from gen. quants are converted to double and some info (I think adaptation etc.) is streamed as string.
  2. Choosing a target format for the serialization (i.e. the outside interface) - which seems to be the present concern.

Now 1) has also been a bit of a bike shed and the discussion didn’t really move forward (I admit I was sometimes less than ideal part of those conversations). But I think solving 2) would be easier and a bit less contentious if 1) was implemented well, as supporting additional formats/switching to those formats would be easier.

Hope you will be able to move this forward now :-)

3 Likes

These are separate issues in part, because a solution to (1) could support multiple solutions to (2). But we can’t just jump to (2), because (1) determines what gets serialized in (2).

I wouldn’t have characterized the prior discussions as bikeshedding so much as everyone having different opinions on the specifics of (2). That’s true even for our current set of outputs and even more of a problem for new proposals like adding trajectory information optionally. Issues arose such as whether we need a binary or human-readable serialization format, whether we needed metadata on every row of output, whether NaN and Infinity and denormalized numbers were in scope, whether various kinds of output got serialized to the same stream (file) or separate ones, and similar issues. Bikeshedding would’ve been debating the file names or the form of text output for time stamps that were never going to be parsed programatically.

We also had differing opinions on (1), specifically on whether we wanted to go with something like @seantalts’s proposal of converting all the writers to static like a logger pattern and if not, how many new writers we needed to support new output and how they’d be organized so they could be used by CmdStan, RStan, and PyStan. The issue there is that these all have different requirements and different things which are hard or easy depending on choices that are made. We couldn’t even settle on Eigen vs. std::vector data structures on the inside because of issues about whether callbacks would be easy to write in Python or R.

1 Like

Sorry to stir what I know is a bit of a bees-nest by bumping this topic, especially given my limited expertise relative to everyone else that has chimed in, but a recent complaint regarding slow cmdstanr::read_stan_csv performance reminded me of this topic.

During StanCon I learned of @ahartikainen and colleagues’ work on ArViz, which uses NetCDF as its core file format, which in turn is really just HDF5 under the hood (as far as I understand).

Seeing as there is NetCDF expertise among Stan/Stan-adjacent devs, and possible motivation for said devs to help Stan itself move to NetCDF to make their job over in ArViz easier, and where the format debate doesn’t seem to have any clear winners otherwise, should we pull the trigger and ask the NetCDF crew to start working on a cmdstan branch replacing CSV with NetCDF?

Obviously they have the freedom to create said branch on their own regardless of an invite, but I think it would be more likely to actually happen if they were assured that their work would be a welcome solution.

Thoughts? Any major downsides to NetCDF specifically that make this an absolute non-starter?

1 Like

Oh, and what I’m suggesting is what @ariddell and @martinmodrak refer to as the second refactor target, where we just want to be writing to file more efficiently. (Unless I’m missing something and netcdf is somehow also useful for the internal refactor target too.)

Where NetCDF/hdf5 is fairly flexible, I don’t see a big reason to wait until the internal refactor stuff is solidified before the external refactor begins.

2 Likes

I like the idea of hdf5. I used the format extensively while working on scientific computing. It’s reasonably fast, and friendly to multiprocessing(MPI). On the other hand, working on large Stan project I find myself frequently grep/awk output file before it finishes, and with csv it’s straightforward. It’s also easy to check crashed/terminated results. Doing this with hdf5 is possible but involves much more work.

2 Likes

IIRC there are problems with HDF5. I remember data corruption being
among them.

Here’s a blog post that seems relevant:

I suspect that the Apache Arrow/Feather folks also describe the issues
with HDF5 in great detail somewhere.

1 Like

Hi, I can drop some of my ideas here too. Here are some comments I made in slack.

There now exists fully rewritten C++ API http://unidata.github.io/netcdf-cxx4/index.html

Possibility for unlimited dimensions https://www.unidata.ucar.edu/software/netcdf/docs/unlimited_dims.html

And possibility to access / write data parallel
https://www.unidata.ucar.edu/software/netcdf/docs/parallel_io.html

And, this could help create different groups for different parts of the inference (for Stan I think --> warmup, diagnostics, posterior, sampler settings etc) --> possibility for multiple chains (maybe with unlimited dimensions)

Also, this could enable us to write all the (leapfrog?) steps that are calculated but not saved currently (etc for divergent transitions)

....
Oh, now I get it, you need to install it as netcdf4 lib as in tbb is needed (runtime dependency?)

So there are some benefits of using netCDF. Possibility to keep metadata + sampling data in different locations but still in one file. Possibility to add new/different sampling data if needed. Parallel write. Possibility to write and read.

For Stan, I don’t think adopting InferenceData structure is needed. It should stay focused for the job: saving samples and other needed information.

Downside is that C++ interface needs runtime installation to work.

2 Likes

Apache Arrow is serious project which might be a good idea to checkout.

I’ve been using it at a pretty high volume for the last several years (using Mac Linux and windows) and have never experienced corruption problems. But I will do some googling to see what I can find.

But do we have anyone with experience with it and motivation to add it to stan?

Or, am I possibly misrepresenting the likelihood that you and your ArViz crew Will have these ( expertise, motivation ) with respect to netCDF?

It depends on what interface are we now talking about.

Currently we support Python + Julia + CmdStan (csv) interfaces, and all of the take data from the native datatype (e.g. pystan fit object) and convert it to the netCDF format.

If we want our samplers to write straight to netCDF, we need to use C++ interface. That would need someone with C++ experience. What ArviZ would need to do after this is to do some editing/handling groups etc (netCDF → netCDF).

If we would go with netCDF / hdf5 solution / or with any solution, I don’t think making needed writers would be the bottleneck for the adaption. Figuring out the good way to split different parts of the data (e.g. metadata, sampling data, optional extra data (e.g. leap frog steps etc) and handling them with meaningfull and “simple” way is the hardest part.

I also mentioned in the slack conversation that could we just save everything into a unique file (text/binary) and wrap all the files to .zip container. By wrap, I mean saving straight to a zip (or any other container). In the end, we are saving a tabular data + metadata. Is there any data that could be multidimensional?

1 Like