Schema for callback writers: one step towards retiring CSV

syclik · July 10, 2017, 9:54pm

@ariddell, I’m thinking of just instantiating the writer with the names. Then every call to the writer after that will be values. No passing of names, no ability to reset the names.

This is sort of a trick question. I really don’t think we need to pass around the vector of strings, especially down at where the inference algorithm uses them. But the IO will only suffer when it’s actually hitting the disk. I can imagine an implementation of the writer deciding to do what is natural for you and write out parameter name and parameter value while another one only writing out the values.

Bob_Carpenter · July 10, 2017, 10:09pm

I didn’t mean to imply there were two methods for getting rows. The fetchone() method seems to return the next row’s data without the column identifiers. Those can presumably be fetched separately. The DB spec may be unusual for Python, but what we’re doing is essentially pushing out database rows, so it seems like the right example to follow unless it got replaced with some new how-to-write-database-interfaces spec in Python.

bgoodri · July 10, 2017, 10:46pm

Oh, it is still there.

ariddell · July 11, 2017, 12:21am

This sounds like a good way to do it.

syclik · July 11, 2017, 12:50am

Mind opening an issue and tagging me on it? If you can point me to it, I’ll try to tackle it when I have a few free hours.

Bob_Carpenter · July 11, 2017, 3:29am

I dug a little deeper and saw this as the first item in their FAQ.

Note that the reason for not extending the DB API specification to also support dictionary return values for the .fetch*() methods is that this approach has several drawbacks:

Some databases don’t support case-sensitive column names or auto-convert them to all lowercase or all uppercase characters.

Columns in the result set which are generated by the query (e.g. using SQL functions) don’t map to table column names and databases usually generate names for these columns in a very database specific way.

Apparently they would’ve at least provided a way to return dictionaries for rows if it were portable.

syclik · August 10, 2017, 4:32pm

I just pushed an update to the wiki with a more detail. It’s not fully spec’ed out to the point where every detail is worked out, but if it looks reasonable, I can try to put together a prototype.

syclik · August 10, 2017, 4:33pm

Here’s the link to the new part:

ariddell · August 10, 2017, 6:54pm

I like this proposal but I wonder what Bob or someone else who may have encountered a similar scenario before thinks.

Are there some alternatives here to consider?

One alternative: could we have some time to evaluate how needed the change is right now, that is, with the recent change of having separate error and info writers? Or, alternatively, might the changes be introduced incrementally?

Is there still no interest in my proposal for a single writer “system bus” with different message types? Knowing very little about software engineering, I do like the spirit of that proposal. It seems familiar from the design of current RPC protocols such as HTTP/2: you call a function and get back one or more streams of messages. Additional context is contained in the schema of the message itself – that is, the data in the stream is modestly self-describing.

sakrejda · August 10, 2017, 9:09pm

Can you add something about the goal here? What the improvements are relative to the current implementation. Overall @ariddell’s proposal is at a higher level so I _think_they’re independent but it might help to address that too.

I think this actually brings us closer to making typed output possible so we can avoid the firehose of text we currently have.

sakrejda · August 10, 2017, 9:31pm

Currently some of what Daniel is suggesting is possible but it involves duplicating a lot of logic in the writer. For example, if you want to write out the iteration number you have to assume that the method that gets called with a std::vector is getting called once per iteration… which isn’t true with thinning, or you have to pass in the thin number, or something similar. Other things are even more difficult. So that leads you to duplicate logic in the call and in the writer, and we make no promises about how the writer is called. The spec so far is making thinning/etc… the writer’s job, which at least centralizes the logic.

This spec actually makes implementing the “system bus” approach easier since the parts of the API that are outputting different message types are separate methods in the writer. On a second read I’m pretty sure it’s just a different concern—the API is not specifying anything about how the output is written out, just which calls will get made to trigger writing and what information is available. For that reason I really encourage you to not view it as competing alternatives.

Bob_Carpenter · August 10, 2017, 10:46pm

I still haven’t digested Daniel’s proposal, but I may be the one who put him up to it!

I like the idea of a streaming representation that can be used in the spirit of HTTP. It could even be built on HTTP—I’m not much of a network protocol designer.

I do think that a streaming representation on the network should be built on top of a callback style in-memory interface. Then the communication and message protocols sit on top of the efficient low-level implementation (which isn’t adding any overhead versus just messaging directly).

I suppose someone could decide the intermediate callback layer of reprsentations wasn’t worth maintaining. I’d be convinced if we had a networking layer and all the interfaces used it.

Bob_Carpenter · August 10, 2017, 10:50pm

Thinning provides all sorts of interface issues. I don’t even know if we report the thinned or unthinned iteration number in the interfaces. Nor is it clear to me which would be better.

I could see thinning as some kind of filter on a writer (lower case, Sean). But it seems like thinning is exactly the kind of thing we wanted pulled back into the C++ side so all the interfaces didn’t have to implement it.

And yes, totally agree about there being two different levels to this.

System Bus Stan is parallel to CmdStan in my mind, though I think Allen’s thinking that PyStan would be built on top of System Bus Stan, which is also totally OK.

ariddell · August 11, 2017, 12:29pm

The case for “System Bus Stan” without the proliferation of different writers is that, when there’s some new kind of data that Stan is producing, you don’t need to update the function signatures and implement all the callback classes. You just learn how to parse the new kind of message coming down the pipe.

Maybe I can also make a comment here that’s a general response to Daniel’s spec. I think we might err on the side of letting interfaces shoot themselves in the foot if they want to. If they don’t know how to deal with something that Stan is producing, that’s their fault (assuming it’s well documented in Stan). My sense is that the callback classes approach (vs. system bus approach) might be, in part, an attempt to make sure if the interfaces screw something up, it’s a compile time error. Here I’d just note that we could consider giving interfaces more responsibility for figuring things out – with the gain being some generality/simplicity in the C++ API. After all, the interfaces are figuring out how to parse some moderately heterogeneous things right now.

syclik · August 11, 2017, 3:00pm

Sorry I haven’t been too clear with the goals. I’ll try to outline some of the problems that I see and that this proposal addresses.

Problem: it’s currently very hard to separate the sampler parameters, sample parameters, and the constrained parameters
- The use case is in showing sampler diagnostics and showing estimates of parameters (like in ShinyStan)
- This is currently being done by parsing the names of the output, searching for trailing double underscore, and asserting that those are not parameters
- Note: the goal is not to intentionally separate them out to ignore the sampler parameters or the sample parameters from the draws. It’s so we can manipulate them easily; the sampler and sample parameters, for a given Stan version, will be the same across Stan programs whereas the number of unconstrained parameters are not.
Problem: to output unconstrained parameters, it also outputs the sampler parameters and sample parameters
- Since everything is output at once, to get the “diagnostic” output, we duplicate a lot of output
Problem: there is no mechanism to output the constrained parameters separate from transformed parameters or generated quantities or unconstrained parameters
Problem: it’s hard, in implementation, to filter parameters
- RStan and PyStan both have the ability to filter parameters. This would make the logic a lot easier.
- Allen: the system bus makes this very inefficient.
Problem (design): the algorithm code currently knows about how to output constrained parameters by taking the unconstrained parameters and pushing it through the transforms
- The sampler should just operate on the unconstrained parameters and just use a templated model template that exposes a log_prob() function. Instead, the sampler itself calls the transform code taking unconstrained parameters to constrained parameters making it a little more difficult using the algorithm code with another C++ model template not generated by Stan.
- All that logic would just be moved from the sampler code to the services layer, which conceptually is the right place for these things to be happening.
Problem (design): changing sampler and sampler parameters change all of the output and there’s no decoupling between that and the unconstrained, constrained, transformed parameters and the generated quantities
- The current output is just one std::vector<double> per iteration. Adding a sampler parameter actually affects everything that uses the output. By splitting the output into separate chunks, we can independently update parts. Ideally, the parts that handle the unconstrained parameters, constrained parameters, transformed parameters, and generated quantities don’t have to change often.
Problem: we promote ints to doubles for the sampler parameters (and also generated quantities)
- We can fix this for the sampler parameters. Can’t do it easily for generated quantities.

Regarding thinning: I’m trying to think of @bgoodri’s use case with the accumulators. I think the cleanest thing is to always pass the sampler parameters, sample parameters, unconstrained parameters, constrained parameters at each iteration and allow the writer handle thinning however it wants. For the use case where it goes into memory and is thinned, the writer just needs to keep the thinned samples and do nothing with the rest. Now, if there’s an accumulator for means / variances / whatever, I don’t know if that should work on every iteration or just the thinned iterations. If that were to move back into the service method, which is reasonable, do we only pass back the values every time it’s supposed to be saved? If so, there’s really no way to have an accumulator for every iteration.

Regarding a system bus: I’m not opposed to that as the underlying implementation, but it doesn’t help solve a lot of these problems. There’s just no efficient way to selectively save parameter. In order to do that, you need context to figure out what the message means, which means inspecting the message. Then doing something to break apart the message, then filtering, then reconstructing it. If you want that to be the implementation for httpStan, I’m happy to help.

Did that help at all? If there’s a better way to communicate this, please let me know what would help (maybe an example) and I’ll write it in that form.

Bob_Carpenter · August 11, 2017, 11:12pm

Thanks. This rundown is super helpful.

P.S. Clearly Discourse doesn’t understand natural language. Here’s how it nagged me about this reply:

“This topic is clearly important to you – you’ve posted more than 20% of the replies here. Are you sure you’re providing adequate time for other people to share their points of view, too?”

Bob_Carpenter · August 11, 2017, 11:14pm

Any idea why is this being done? I’d have thought the algorithms would work purely with the unconstrained parameters and the constraints would be left to the model class’s log_prob function.

Bob_Carpenter · August 11, 2017, 11:19pm

These seem like different faces of the same problem. Should all the output columns of the sampler come with some kind of properties (integer vs. real and parameter vs. generated quantitity vs. sampler diagnostic vs. …

sakrejda · August 12, 2017, 1:10pm

The approach of writing them out using different writer methods is preferable to me because post-write filtering is more complex than post-write combining. I guess with tags filtering would be easier but why mash them together by default in the first place? I could see doing both but having to split out unconstrained parameters every time I need to plot sampler behavior seems unnecessary.

betanalpha · August 12, 2017, 8:47pm

@syclik’s claim is misleading. The samplers work completely on the unconstrained space and then pass their output to the mcmc_writer class which send the unconstrained info to the diagnostic writer if it exists and uses the model instance’s write_array method to map the unconstrained values to constrained values before sending to the output writer. Moreover, the mcmc_writer doesn’t even live in the algorithm path; it lives in the services path.

Topic		Replies	Views
Status of new writer? Developers	17	1405	March 8, 2018
Status of the IO re-factor? Developers	16	1163	September 18, 2020
Protobuf style guides Developers	3	2408	May 22, 2018
Output specification, yet again Developers	2	635	January 12, 2017
Universal, static, logger-style output Developers pystan , rstan	57	2972	August 14, 2018

Schema for callback writers: one step towards retiring CSV

Related Topics