Status of new writer?

For showing diagnostics during sampling for bigger models, the current sample file output is a pain to deal with. I understand that google’s protocol buffers are being explored as a new approach. What’s the status of this? I think I see that it’s being used in httpstan; will each interface be implementing it’s own writer separately?

Here are links to my initial work on getting this into Stan core (it’s part of the GSoC project for 2018 if we manage to recruit someone who we think will be effective):

Originally @ariddell was involved but we wanted to emphasize different aspects of protobuf and he
decided to do httpstan on his own. I hadn’t heard that protobuf was being used there so I’m as curious about plans for it as you are.

1 Like

httpstan munges the writer output into protobuf in C++ (generated by Cython) Python. A dedicated protobuf writer in Stan C++ would definitely be preferable.

The httpstan protobuf schema could serve as a starting point if anyone were interested in this. Here it is: Here’s where the conversion happens:

It’d be nice to have that. It’d require settling on a format with which to write. I looked at the doc in callbacks_writer.proto and was wondering what actually gets sent. Specifically, is it going to need to ship the metadata with each draw?

Presumably there’s already a standard implementation of protocol buffers that’s not copylefted.

Then we’d have to make sure it wasn’t too hard for you to integrate into Python—I don’t know what that will require.

Yes, it does ship the metadata with each draw. Having each message be self-describing seemed like an advantage.

protobuf is BSD licensed, so the code generated is BSD. httpstan is ISC licensed.

edit: add protobuf license details

Why? It just seems overly heavy to me compared to shipping the header once then shipping the rows. I know we’ve had this discussion before, so sorry for being repetetive.

httpstan licensing wouldn’t matter as you’re asking for something in Stan and that’s a reverse dependency.

I agree with Bob - shipping meta-data is extremely expensive, and the slowdown and overhead far outweighs any advantages.

My thinking: (1) no need for optimization, we have all the time in the
world between (slow) draws so no need to worry about how fast things get
transferred between Stan C++ and the interfaces and (2) explicit is
better than implicit.

Of course when it comes to storage, we should definitely optimize
things and not repeat the metadata. I’m just thinking here about the
narrow task of sending a message from Stan C++ to the interfaces via

saying something once is explicit. saying something every time is a waste of processing cycles.

Unless the packing into protobufs is done in parallel with spare cores and there is unused transmission bandwidth, there will be additional latency. If it’s over a network, the cost is higher. Shipping data isn’t free—it clogs the data highways. When we start adding multi-core parallelism (what I should be doing now), the memory demands will go way up. So while I can believe you couldn’t measure a difference on a dedicated server, I think things would be different if you’re running multiple jobs and trying to broadcast over a network.

I can’t imagine that any of the database interfaces ship the metdata with each row of data, for example. You may argue that they have much less latency, but it’s the same thing—they have these huge database computations and usually the data’s not a big issue until it is.

I agree that we can be explicit ones and loose nothing while gaining some simplicity. I probably want more shipped than Bob does (I’d like to ship out a copy of the input config) but not repeat info. I don’t recall which way I wrote my writer but agree that getting agreement on .proto files is most of the work.

I’ll defer to the .proto written by whoever ends up writing the PR which
adds the proto writer and associated tests to Stan.

I think that’s a good idea. It’s part of the output in CmdStan.

I’m just reluctant to congest our main data pipeline with redundant schema specifications. It might be different if something was tapping into that pipe asynchronously and couldn’t query the header. Right now, I don’t see any way a partial result (only some of the rows without the header) can be reconstructed usefully given the way we analyze posterior output. Am I missing some kind of use case where it’s more robust to have all this on each line of output?



    February 23

I’d like to ship out a copy of the input config

I think that’s a good idea. It’s part of the output in CmdStan.

I’m just reluctant to congest our main data pipeline with redundant schema specifications. It might be different if something was tapping into that pipe asynchronously and couldn’t query the header.

Even then you can requery the header rather than forcing it to be sent every time.

While I understand the motivation that’s just not good for the project.

I mean, I have a writer with thorough test coverage that could go into a PR right now, but we’re only part way to having a format. Part of the reason it never made it into a PR was feedback about design and instead of shoving a PR through I ended up supporting the work @syclik did on services and loggers. It sucks that this is so slow but long-term I think it’s worth the effort (and the associated design docs look a lot better than they did at the time).

I’m really quite happy to defer to your design.

I went ahead and implemented something hopefully simple enough. I’ll share in the next few weeks once it’s a functional example.

1 Like