Status of new writer?

mike-lawrence · February 21, 2018, 5:28pm

For showing diagnostics during sampling for bigger models, the current sample file output is a pain to deal with. I understand that google’s protocol buffers are being explored as a new approach. What’s the status of this? I think I see that it’s being used in httpstan; will each interface be implementing it’s own writer separately?

Krzysztof_Sakrejda · February 21, 2018, 5:55pm

Here are links to my initial work on getting this into Stan core (it’s part of the GSoC project for 2018 if we manage to recruit someone who we think will be effective):

Originally @ariddell was involved but we wanted to emphasize different aspects of protobuf and he
decided to do httpstan on his own. I hadn’t heard that protobuf was being used there so I’m as curious about plans for it as you are.

ariddell · February 23, 2018, 1:33pm

httpstan munges the writer output into protobuf in ~~C++ (generated by Cython)~~ Python. A dedicated protobuf writer in Stan C++ would definitely be preferable.

The httpstan protobuf schema could serve as a starting point if anyone were interested in this. Here it is: https://github.com/stan-dev/httpstan/blob/master/protos/callbacks_writer.proto Here’s where the conversion happens: https://github.com/stan-dev/httpstan/blob/master/httpstan/callbacks_writer_parser.py

Bob_Carpenter · February 23, 2018, 7:05pm

It’d be nice to have that. It’d require settling on a format with which to write. I looked at the doc in callbacks_writer.proto and was wondering what actually gets sent. Specifically, is it going to need to ship the metadata with each draw?

Presumably there’s already a standard implementation of protocol buffers that’s not copylefted.

Then we’d have to make sure it wasn’t too hard for you to integrate into Python—I don’t know what that will require.

ariddell · February 23, 2018, 7:12pm

Yes, it does ship the metadata with each draw. Having each message be self-describing seemed like an advantage.

protobuf is BSD licensed, so the code generated is BSD. httpstan is ISC licensed.

edit: add protobuf license details

Bob_Carpenter · February 23, 2018, 7:28pm

Why? It just seems overly heavy to me compared to shipping the header once then shipping the rows. I know we’ve had this discussion before, so sorry for being repetetive.

httpstan licensing wouldn’t matter as you’re asking for something in Stan and that’s a reverse dependency.

mitzimorris · February 23, 2018, 7:41pm

I agree with Bob - shipping meta-data is extremely expensive, and the slowdown and overhead far outweighs any advantages.

ariddell · February 23, 2018, 7:42pm

My thinking: (1) no need for optimization, we have all the time in the
world between (slow) draws so no need to worry about how fast things get
transferred between Stan C++ and the interfaces and (2) explicit is
better than implicit.

Of course when it comes to storage, we should definitely optimize
things and not repeat the metadata. I’m just thinking here about the
narrow task of sending a message from Stan C++ to the interfaces via
protobuf.

mitzimorris · February 23, 2018, 7:47pm

saying something once is explicit. saying something every time is a waste of processing cycles.

Bob_Carpenter · February 23, 2018, 7:48pm

Unless the packing into protobufs is done in parallel with spare cores and there is unused transmission bandwidth, there will be additional latency. If it’s over a network, the cost is higher. Shipping data isn’t free—it clogs the data highways. When we start adding multi-core parallelism (what I should be doing now), the memory demands will go way up. So while I can believe you couldn’t measure a difference on a dedicated server, I think things would be different if you’re running multiple jobs and trying to broadcast over a network.

I can’t imagine that any of the database interfaces ship the metdata with each row of data, for example. You may argue that they have much less latency, but it’s the same thing—they have these huge database computations and usually the data’s not a big issue until it is.

sakrejda · February 23, 2018, 8:24pm

I agree that we can be explicit ones and loose nothing while gaining some simplicity. I probably want more shipped than Bob does (I’d like to ship out a copy of the input config) but not repeat info. I don’t recall which way I wrote my writer but agree that getting agreement on .proto files is most of the work.

ariddell · February 23, 2018, 9:18pm

I’ll defer to the .proto written by whoever ends up writing the PR which
adds the proto writer and associated tests to Stan.

Bob_Carpenter · February 23, 2018, 9:38pm

I think that’s a good idea. It’s part of the output in CmdStan.

I’m just reluctant to congest our main data pipeline with redundant schema specifications. It might be different if something was tapping into that pipe asynchronously and couldn’t query the header. Right now, I don’t see any way a partial result (only some of the rows without the header) can be reconstructed usefully given the way we analyze posterior output. Am I missing some kind of use case where it’s more robust to have all this on each line of output?

sakrejda · February 23, 2018, 9:56pm

I

Bob_Carpenter
Developer
    February 23
sakrejda:
I’d like to ship out a copy of the input config

I think that’s a good idea. It’s part of the output in CmdStan.

I’m just reluctant to congest our main data pipeline with redundant schema specifications. It might be different if something was tapping into that pipe asynchronously and couldn’t query the header.

Even then you can requery the header rather than forcing it to be sent every time.

sakrejda · February 23, 2018, 11:57pm

While I understand the motivation that’s just not good for the project.

sakrejda · February 24, 2018, 12:01am

I mean, I have a writer with thorough test coverage that could go into a PR right now, but we’re only part way to having a format. Part of the reason it never made it into a PR was feedback about design and instead of shoving a PR through I ended up supporting the work @syclik did on services and loggers. It sucks that this is so slow but long-term I think it’s worth the effort (and the associated design docs look a lot better than they did at the time).

ariddell · February 24, 2018, 12:05am

I’m really quite happy to defer to your design.

sakrejda · March 8, 2018, 1:13pm

I went ahead and implemented something hopefully simple enough. I’ll share in the next few weeks once it’s a functional example.

Topic		Replies	Views
Schema for callback writers: one step towards retiring CSV Developers	99	4497	August 14, 2017
Httpstan 1.0.0 released Developers	5	612	December 14, 2019
Protobuf style guides Developers	3	2499	May 22, 2018
Suggestion: use hdf5 in swmr mode to write sample_file Developers stanc	8	1297	February 21, 2018
CmdStan dataset loading speed Developers cmdstan	21	1290	February 14, 2020

Status of new writer?

Related topics