Universal, static, logger-style output

pystan
rstan

#21

string vs. protobuf is a false trade-off. Some things about C++ are hard but allowing plug-ins (of the kind we currently have with writers) for handling details of output format is easy in C++, why force the decision?

The size-on-disk is not the barrier to using ASCII, it’s the time it takes to load it.

rstan, for large enough output and for deeply nested arrays must use the .csv output due to memory limits and due to issues caused in R/Rcpp by deeply nested std::vector<std::vector<...>>. Reading .csv (or any other text-based format, .csv is fairly amenable to optimization) is a bottleneck, as you can see in the discussion I linked.

My point was that whatever way you do it, it’s not like there’s massive complexity in our output types that interfaces have to handle. It’s the routing that’s harder to deal with.

This hasn’t been articulated outside of conversations that happened off-discourse. Your proposal sounds like you want to stringify deep in stan-dev/stan, in the same fashion that our mass matrices are currently stringified. That seems like a big cost since it’s one of the reasons there’s no clean code for extracting the mass matrix in rstan.

This discussion is going to be a lot more productive if you can be specific about what “complete decoupling” is. In your proposal I see three features: 1) text tags; 2) stringify everything early (and therefore disallow the plug-in approach to output type handling; and 3) use a static object to handle all output. So, what do you actually want these design choices to accomplish?


#22

Good point - let’s jump up to the goal level here. We have at least two goals (please jump in with more):

  1. Make it easier to add (or ideally, even change) a new type of data output
  2. Make it easier to add a new algorithm, service, model method, etc
  3. [edit - added by @sakrejda] Make it easy for the interfaces to handle the output.

Generally we want to avoid thinking a lot about logging whenever we’re doing some of these unrelated tasks, and we also want to avoid thinking about upstream (interface-level) consequences as often as we do. Not thinking about logging means not having to write all methods with 6 writers to satisfy an interface when creating a new algorithm. When avoiding upstream concerns, it’s highly prized to be able to add a new output (e.g. trajectories) without breaking existing interface code. You might also want to be able to make slight tweaks to the formatting or exact data sent for particular types of data without needing to change code in all of the interfaces and other repos (something protobuf as a protocol could help with - change the specification once, have that percolate everywhere as long as the interfaces aren’t using the specific fields you’re messing with. This can also work with cautiously designed ASCII output).

I don’t think I understand what “the plug-in approach” refers to… Can you tell me more? If you’re talking about your post above proposing using templates to figure out how to serialize different data types, I think that’s an orthogonal concern at potentially a slightly deeper level not affecting the 1) tags vs multiple streams 2) ascii vs binary 3) static vs local decisions. I think that because 1) output to a file or files is handled much later 2) you can still use that coding technique to serialize a std::vector<double> to ASCII, but we also need to write out what that data represents and 3) we could pass such a function along or refer to it globally equivocally.


#23

Something like callbacks where the decision about exactly how to handle output is punted to some other code. I think this affects #2 (ascii vs. binary) because once you go text you can’t sensibly go back. It doesn’t affect #1 (tags vs. multiple streams) because both capture routing information ok.

#3 (static vs. local object) seems like a separate concern. Having gone back over the discussion I’d like to know what you think about having a more standard division: pass local objects to handle routing of output and allow logging to be a global static object. For logging a static object makes sense because it might be called from basically anywhere. For output you should only have to pass the local object to a service method so a static object doesn’t have much benefit.


#24

I think a good third goal is to make it easier for interfaces to handle the output. CmdStan, rstan, and pystan can all handle getting a *double + size (as can any language with a C API) and it would be trivial to write default handlers that stringify everything. If we stringify early we’re forcing rstan and pystan to transform back to binary floating point representation.


#25

Gotcha. It’s not impossible to come back from text, but having more structure is way better, agreed. Seems like maybe this issue could be separated a little further - even if you want to eventually pass ASCII to RStan et al, you could delay the stringification considerably (enough to allow for plugins to modify the data qua data on the way out).

Honestly I am not familiar enough with the code to comment here - I could see it going either way depending on how often we’re writing data (vs logging). Can I tag in @Bob_Carpenter?

Yeah, that’s a reasonable goal. I’d also put it at the end of that list, I think, if only because in my experience writing the serialization layer is usually a fairly trivial part of an application (even if one creates one’s own ASCII format, not that we couldn’t use something existing). That said, no reason to make extra work for people if there’s a library that exactly meets our needs for serialization and has stubs for our languages.


#26

Most of our text output comes from either:

  1. the messages in exceptions from the math library that get caught by the algorithms,

  2. errors in the algorithms like log densities evaluating to zero, and

  3. direct algorithm output, like iteration number.

All of the structured output for draws comes from the algorithms, but that’s largely driven by calling the model class’s method write_array(), which converts unconstrained parameters to constrained parameters, transformed parameters, and generates the generated quantities.

We haven’t coded things like trajectories yet, but those will presumably look like our sampling output.


#27

That has also been my experience. Even when I had to use elaborate Java serialization hacks for forward compatiblity [that post is wrongly atributed to Breck, like most of the old posts on that site—one of the reasons I really dislike WordPress].


#28

This is pretty easy to establish by a find/grep call, I’ll see if I can produce it.


#29

I should have been more specific, but what I really meant was some sense of how deeply nested the data output calls are, some sense of how unwieldy it would be to thread just the data writers through broadly - I think this is sort of intuitive and maybe comes mostly from experience (though perhaps there are metrics that could approximate that). Happy to defer to you two here.


#30

That’s what I understood too. I did this by find/grep and checking every file when I did the summary of our current usage of writers so this can be answered fairly quickly by checking those results again.


#31

I’m not really sure but it sounds like most of the pain is around the addition of new data writers rather than the addition of new print statements, so it probably still makes sense to have a static data writer as well. @Bob_Carpenter does that sound right? If so, then we can design the data-writing API together first, then figure out the API or way in which the interfaces connect to the output (one vs many streams, callbacks vs. files/sockets vs. ?, …). All while keeping in mind our three goals. Sound good?


#32

Not sure what you mean by data writer.


#33

for mcmc, the calls are in a service method (‘generate_transitions’) so not deep. For optimization and ADVI it’s directly in the service method as well


#34

I mean mechanisms by which we output additional data. We currently output things like parameter values each draw, but also want to add data writers for divergent trajectories for example. I was trying to use the same distinction you brought up between logging and writing data.


#35

We mis-communicated. I think a static object is a good match for anything that needs to be sent from deep within the algorithm code, including divergent trajectories, iff we allow the logger to maintain some type information (I’d hate to see us sending trajectories as text). What I don’t want the logger to handle is the normal algorithm output (posterior samples, unconstrained posterior samples, gradients, momenta, optimizer estimates at each step) because those are available at a shallow depth (usually in service methods).

So if you look at the issues I have up on stan-dev/stan (they’re the two most recent I filed, sorry I can’t easily pop a link in here) one thing that’s missing on my end is how the relay object would be instantiated. On your end one thing that’s missing is how we want the interfaces to configure this static logger. Those two are basically the same issue (I want the relay configured on construction). I guess that means I agree (?) that we should figure out the interface end next.


#36

If you already have a data logger available, why not use it for everything? What do you gain by having an additional set of writers passed in for things that you suspect won’t need to be written too deeply?


#37

Why do you say suspect? I literally just read the code with my eyeballs so I could make that statement. I don’t want to use it for everything (yet) because I have yet to see this design that can handle dozens of threads/chains dumping large vectors of posterior samples into it. I look forward to checking it out!


#38

Sorry, I definitely think you know the code better than me and well enough to figure out how deep things are now - I mostly meant that things can change and it’s hard for anyone to predict the future needs. [edit] So I’d tend towards simple and unified until we have a need.

[also edit] What can we do to investigate the risk here? Should we build a prototype and write a benchmark style test for it to show throughput maybe?


#39

It’s not that I’m personally offended, I just want these conversations to be based on references to actual code rather than our feelings about what we remember (that’s why I keep checking the actual code). I don’t think predicting future needs is necessary, but a good design is.

From my point of view I think we can punt on this question and talk about how to configure routing first. Either way an object gets constructed to handle routing.

I do think at some point we will have to decide about the trade-off. You want to make the logger static and global and you’re willing to pay the complexity cost of writing a high-performance multi-producer single-consumer queue. I want to make it a local object and pay the cost of passing a different one to every chain via the service layers. At that point if you can prototype something that will process 200k parameters from each of a dozen chains I guess it’s a fine design. I guess it could be thread_local and then you would be back to single-producer.


#40

I have low confidence, but I don’t think we have any current multiprocessing stuff (or immediate plans) that require anything complicated here: chains are separate processes executing the same program but sharing nothing; map_rect threaded AD is all within a likelihood; MPI also has separate processes that don’t share memory; GPU is a single process… Am I forgetting anything? It’s Friday afternoon so please excuse (but point out) any omissions!