Dataset storage and access in R and Python (Wes McKinney's 2017 Outlook)

Likely of interest to everyone, I think: 2017 Outlook: pandas, Arrow, Feather, Parquet, Spark, Ibis

For us the relevant part is an update on Feather. Feather is a project that Wes and Hadley have been working on which provides a way to interact with Apache Arrow formatted files from Python and R. My sense is that Apache Arrow (“standard for high-performance in-memory columnar data structures and IO”) is a push to get something better than HDF5, but I could be wrong about that.

Needless to say, this is something that might be great for Stan. If people do have a gigabyte of draws, they probably do want to store it in something like this.

Last year Wes gave a talk at the NYC R Conference. He specifically said it wasn’t production ready citing a few key points, including stability of API. There were a lot of good reasons to use, so when it’s mature, I think it would be a great option to consider.

Of course, if anyone wants to come up with a design that will work across all our platforms (OS x computing framework x compilers), coordinate the work, and actually implement something, then I wouldn’t mind shifting to an early technology that has a lot of potential. I’m not volunteering to take that on because I think I know enough of the risks involved, but I’m sure someone that knows how to deal with IO, changing APIs, and dealing with user installation issues better than me can figure this out.

What isn’t ready? These messages don’t thread for me,
so I lost the context.

  • Bob

The feather file format isn’t ready,

Yes, that’s right. As of last year, the feather file format wasn’t ready. Given the way the 2017 outlook was written by Wes, it looks like the code bases are going to merge and it’s not just putting two projects together, but actually consolidating two apis. If we’re trying to use it for a prototype, I think it’s fine, but to support it for all Stan users, I think it’s too early.

For better or worse, CSV is still industry standard for moving text files. Recently, the folks at data.table just released 1.9.10 which includes a very fast writer (fwrite) by Matt Dowle that competes with, if not exceeds, the speed of feather (both uncompressed). See http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/.

Is there any way for Stan to leverage that or similar code?

Neither of them allow stream compression as of yet, although I think both eventually plan for it. The benefit of CSV is I think because it isn’t binary, standard compression algorithms can work on it more efficiently, but I may be completely wrong about that.

Stan cannot encode chain metadata (non-tabular) and draws (tabular) in a
single CSV file. With feather or HDF5 or JSON this is possible.

While it’s fun to rehash this stuff it also exists in previous discussion
and the wiki. There were a lot of proposals floating around and we had
converged on CSV files for samples plus additional files for metadata. I
think the format for metadata was not settled. Having binary output is a
second more complex issue that feather might solve but I think it’s going
to be a major undertaking to make sure it plumbs through all the interfaces
correctly. K

The way I look at it, we settled on callback functions.
Then people can write whatever data transport layer they
want over that!

The piece I really want is a var_context implemented with
JSON and another with Protocol Buffers. Then we could offer
that as an option for data ingestion. For output, I think
CSV is fine for now. It’d be nice to make that binary in
the future.

  • Bob

(I agree with Bob’s note about the callback functions. Thanks Daniel!)

A task for someone interested in this is to come up with a design
document and/or refactor the existing wiki pages related to this. The
wiki pages include at least:

One thing we all agree on, I think, is that metadata and draws should
not be in the same CSV file.

p.s. why are github wikis so hard to search? Mediawiki wikis (like
Wikipedia) are so much better in this respect.

edit: add bit stripped by Discourse e-mail in

All right, I’ll join in on the re-hashing:

Bob_Carpenter http://discourse.mc-stan.org/users/bob_carpenter Developer
January 4

The way I look at it, we settled on callback functions.
Then people can write whatever data transport layer they
want over that!

Yesbut (<- one word) right now we’ve standardized on csv files with mange
and we could legitimately have a better default.

The piece I really want is a var_context implemented with
JSON and another with Protocol Buffers.

There’ve been a few iterations of Protocol-Buffers-like projects out there
and I’m no longer sure it’s the best one to lean on. Some of the other
ones don’t do as much encoding and some look like they might work with
Eigen::Map without copies (!). Specifically I’ve played around with
Capnproto a little and there’s also Flatbuffers.

Then we could offer
that as an option for data ingestion. For output, I think
CSV is fine for now. It’d be nice to make that binary in
the future.

Sure. I basically did a binary format (on a branch somewhere…) via
Protocol Buffers but we never seriously talked about merging it. What
stopped me from pushing it is that with all the discussion about
re-factoring message logging anything as rigid as a protocol buffers schema
would have to be completely re-written once logging is fixed up a little
more.

Krzysztof

Your post did not actually include any re-hashing. It only quoted Bob. Please re-re-hash.

Is a Keccak-512 hash of re-re-hash good enough?

dac22be56b74d77e29df9b8160ef1e95ef8da97d9585a5dc03ab169aa87b85f2c665a426bae10d43feebc47acaa75920452a14e099d9eb020ea56ede403c6e7c

:)