Universal, static, logger-style output

pystan
rstan

#11

But that’s a logger, I don’t see non-log output pushed through a static object in most projects. I think the reason I don’t see it is that 1) it forces global state; 2) output tends to be big; and 3) output is produced in limited places in the code so passing in an object to route output is not a burden. I could see using that pattern for routing our logging output specifically, since we don’t want to pass the output routing object into every corner of the algorithms and analysis code.


#12

Thanks for bringing up the ‘mcmc_writer’, I think the ‘relay’ concept we’ve been talking about is basically a generalization of this division to multiple algorithms. It’s not clear to me yet if we want to adjust the divisions but we can cross that bridge when we get there. One problem is that I can’t find detailed doc on what the mcmc_writer is supposed to be responsible for, so if it’s clear to you it would help to have that written down.


#13

Right. Logging output is often handled this way.

General output is often handled this way by web applications. For example, a RESTFUL server behaves exactly this way in terms of dumping everything through a single HTTP connection (or multiple connectiosn for a multi-threaded app with multiple clients). Most software-as-a-service is thus set up this way.

Some of that’s done with inefficient, human-readable, schema-free standards like JSON. Some of it’s done with more efficient, binary output standards like protobuf. Others use custom app-to-app socket protocols. I don’t think it’s a big deal for the scale we’re working at. It’s not like there’s a JSON standard or a protobuf standard for matrices. We’d still have to make up our standard within their standard (like column-major vs. row-major output).

That’s the feature of the design that will enable new output to be added. The alternative is design a new writer, then add it to the service methods and plumb it down to where needed. The results’ the same—we accrue writers, but the process is way more painful.

How so? It’s never been a bottleneck in CmdStan.

I’d say it does need it. But the answer to why do it is simplicity and extensibility of server-side design and client-side usage.

You could and that would be the type-safe, textbook-recommended way to do things. But it’s a huge paintain to extend and maintain on the server side and every extension breaks the client.

I not only read it, I thought hard about it. And the code that Martin sent and the discussions on Discourse. I really appreciate the survey of the current output, which I think was critical in moving forward.

It was the Discourse discussion, the hangout and the issue spec that led me to realize this whole approach is going to be difficult to manage. If we start backing off to generic writers, I don’t see why we don’t just keep going and make it really simple.

That’s a useful perspective. The one violation I’m aware of is the log density, which is reported on the unconstrained scale.

I don’t see why we couldn’t keep this perspective on constrained vs. unconstrained output. It’s just a matter of where things get routed.

What I meant is in terms of controlling I/O.

This all got kicked off when Martin wanted to add trajectory information to output (on the unconstrained scale, I think, but that’s a minor matter). This required changing a bunch of function signatures, pluming arguments through a gazillion functions, and would’ve broken backward compatibility for the interfaces.

Please don’t attribute this just to me. I keep getting in trouble with people for over- or under-attributing things, but Sean is definitely behind a lot of this.

How heavy would it be?

It’s a cheap kind of bespoke, as it has a very limited set of data types. But we could go with something standard or even binary like protobuf.


#14

To address Michael’s concern - we could still choose to have monostream output piped through a series of functions that are allowed to modify the output, potentially taking unconstrainted parameters and replacing them with the constrainted ones, etc., if that was something we wanted to make sure we could do. This is often called “middleware” I think.

I’m not incredibly worried about the line-level data formats, but maybe I’m just not that aware of the full breadth of what might pass through? I’m expecting some parsing to occur anyway since a line can only encode a single row (of probably anything, but especially) e.g. the parameter draws. Are there very complicated objects to pass through the tube?

If there’s a static logger object that, one way or another, has access to all of the different tags and types of output you might want to put out, it could make sense to me to have an official enum or set of methods on the logger that hardcoded what message type you were using - maybe this helps address one of @sakrejda’s concerns and lets us avoid code like getLogger()->info("TRAJECTORY {}, {}, {}", ...); manually and instead could give us something like getLogger()->info->trajectory(data); (or I’d guess getLogger()->trajectory(logger.INFO, data); is actually more like what we’d use here to avoid reimplementing trajectory or info for all of the different combinations of elements from the data type and log level categories, respectively). I think that could be nice because it still avoids the plumbing issues we feel quite painfully with every attempt at passing along new types of data and it has, from the C++ side of things, type safety that people like and can help find typos. It doesn’t actually matter from the coder’s perspective here if that becomes an ASCII string or a protobuf or whatever.


#15

I’m on a bad connection until Wednesday so just a quick note: I think this proposal is in some sense very close to what I had in mind, although it approaches the problem from a different angle. The difference may just boil down to whether you prefer the algs to call

“sendXXX(data1, …, dataN)”

“send(XXX, data1, …, dataN)”

or

“send(data1, , dataN)”

In all cases I think you should have default conversion to string, but this should IMHO happen in a separate step of the process so that interface code may opt to not decode string messages and use the binary data directly.

For example, reconstructing the current cmdStan output from message strings as proposed would IMHO be more work than writing handlers that work with the raw data for each message type separately. OTOH this message format could be all that is needed for the proposed httpstan interface.

Also I am glad that someone keeps kicking the ball so that there is some impetus to move forward.

Dne so, 14. 7. 2018 4:16 uživatel Bob Carpenter mc_stan@discoursemail.com napsal:


#16

Yes it has:

  1. Can’t do exact re-starts with text output without making output files huge (that’s why we haven’t done it).
  2. The entire .csv-reading discussion, most recent part here: Alternative .csv reader, this really limits the utility of generated quantities

On the library side we have a lot of table-writing calls defined by the model file, which makes them shared across all algorithms! That code is written once and mostly shared because it’s all tables of numbers. Then we have a few calls per algorithm that write algorithm-specific parameters with more complex types. That’s not a “huge pain” to extend.

On the interface side it’s not completely clear yet because I need to write what sort of constructor arguments would be used to parameterize the relay. I can see that this is not a straightforward problem.

You never articulated how it’s going to be difficult to manage, that’s the issue we need to address. So I’ll do some of that:

One issue that has come up repeatedly throughout the discussion about routing the messages is that algorithm-specific output is more heterogeneous than the model-file-defined output. Algorithm-specific output sometimes also need to be generated much deeper in the algorithm making it less convenient to pass down writers. A few examples from HMC: mass matrix, momenta, and Martin’s example of trajectories.

For that output the benefits of a static logger do become apparent because we would no longer have to pass the output relay object to some arbitrary depth in the stan-dev/stan code. I don’t think it’s a necessary pattern but I can see how it reduces the maintenance burden.

I still don’t see the point of stringifying everything so I suggest a signature like:

logger.debug(...);
logger.info(...);
logger.warn(...);
logger.error(...);
logger.fatal(...);

Where the ... are handled as parameter pack (it’s a standard simple use of templates so hopefully not a burden). The the default implementation (barring some more specific match that lets the interface handle the data in a binary format) can be:

template<T> logger::warn(T x) {
  ostream_ << "WARN: " << stan::stringify(x)
}

with stan-dev/stan defining how various standard types turn into strings

template<std::vector<double>> stan::stringify(std::vector<double>) {
  stringstream ss; 
  ss << x[0];
  for (int i = 1; i < x.size(); ++i) {
    ss <<  ", " << x[i];
  }
  return ss.str();
}

You get your default text stream and you don’t sabotage other methods of returning output. I didn’t get the text-tags in there but you could assume that the logger always gets a first argument that’s a std::string that can be pulled out as a tag.

For the model-file-defined output (see the beginning of “signatures a relay needs to handle”), we 1) generate it at a limited set of points in the services code; 2) sometimes process it using the model (e.g.-parameter transformations); and 3) have different preferences from the different interfaces for how to output it that are all compatible with some use of std::vector<double> or std::vector<int>

For this model-defined output a static logger doesn’t address the main issues (the coupling with the model, the volume of output that makes stringifying everything awkward).


#17

This is cool, though perhaps more suited to a general-purpose serialization library. Here we don’t actually want to log all data of a specific type the same way - in fact we will mostly want to be telling Stan how to log vectors in different ways depending on whether they are parameter draws, rows of an initialization matrix, initial values, trajectories, etc. So I think the call needs to somehow indicate which data it is that we are logging, perhaps as well as the level (though perhaps we could also just say that any trajectory is of level INFO or something).

So maybe it makes sense to have some logger methods defined on the model class that uses the static logger under the hood. That way the model can decide how to process data before its passed on to the static logger (to untransform parameters or whatever).

I also just want to take two seconds to decouple some of the ideas here: having a single static logger does not imply we use ASCII strings as our serialization format, or even necessarily imply we only use one output stream of any kind. Just want to make sure we’re allowing ourselves to explore the full combinatoric possibilities; you could have a single static logger that outputs a different CSV file for every call to logger().trajectory(logger::INFO, v1); or whatever. Not saying that’s a good combination but that these are orthogonal ideas.


#18

How about keeping the output interface simple and putting the burden on the algorithms to create appropriate output? That would involve writing functions to do transforms we need more than once, but critically wouldn’t embed these in the output API. Putting these transforms directly into the I/O seems like it would introduce more coupling than we want.

Right. But I think it would be unusual from a design perspective to do anything other than route output by tag. Obviously by the time we get to an interface, we want the draws to go in one place, the diagnostics in another, and the online output to go to a console or just be recorded for later inspection. But do we need to build all this logic into the output structure?

I had thought the key to the modularity here was to keep the output writers simple and let the interfaces do the dispatch based on tags.


#19

You’re 100% right that that is the ultimate in modularity and decoupling (as far as I can tell). To reiterate, we’re not just looking to simplify the code here, we want to simplify the organizational concerns - we don’t want to have to do every C++ change in lock-step with every interface we want to support. I really love the way we can easily add new data streams to the output and still be backwards compatible with this monostream tag-based plan.

I guess I was looking for potential compromise ideas - we have pain points now at one extreme, but I’m not totally sure it necessitates moving to the complete other extreme. That’s a real “not sure,” not some kind of British understatement. Complete decoupling is certainly very attractive from the C++ side’s maintenance perspective, and probably my preferred approach at the current point in our discussion. But it’s definitely possible @sakrejda riffs off of these ideas to come up with something even better :)


#20

I htink that’s largely right and I’ve been convinced that the send(XXX, data1, ...) is the most straightforward and easiest to code approach for the interfaces and for our algorithms.

Part of the proposal was to make everything human readable. As @seantalts pointed out, this isn’t critical to having dispatch by string tag and a static logger-type output device.

If we went with binary, we’d want to use something like a protobuf wrapper so we don’t need to worry about platform issues for binaries.

We can round trip with about 20 ASCII digits if we need scientific notation, so that’s a factor of 2.5 larger than double-precision binary.

We decided ages ago not to output 16 significant decimal digits because the latter half are basically random bits given the accumulated errors in our sequences of calculations. The only motivation I see are exact restarts, but then I don’t see the motivation for doing that beyond 6 decimal places of accuracy or so.

If it were free, then sure—I’m not suggesting we try to round if we use binary output!

Smallish integers are smaller in ASCII than in binary.

Side note: For exact restarts that will wind up matching results that weren’t restarted, we need to figure out how to serialize the PRNGs and reinitialize them

There is time overhead. If everything’s binary, then it’s just network/file connections, but if it’s ASCII, then there’s a conversion that will take dozens of operations to do the conversion. it’s this that hasn’t ever been a bottleneck in CmdStan other than in reading huge data files that were fit with a very simple linear regression in optimization. In that case, CmdStan’s I/O takes up well over 90% of the total compute time.

This is just CmdStan, right? The other interfaces control their own ouptut as of version 2.17. We could change that. We could write in binary, or we could write to some format other than .csv.

I believe Allen and Ari’s current plan for PyStan is to use JSON. Is that the kind of thing you’re thinking or are you thinking something binary?

Meta-commentary

I’m going to try to stay away from subjective terms like “huge” form now on, as they’re just getting in the way of the discussion:

My use was subjective and @sakrejda’s was a factor of two. I just don’t want to devolve into semantics here.

Can’t we just write functions for shared operations?

In modern C++, it’s much more common to write standalone functions rarther than adding methods to a class. That’s partly because it’s more flexible for compilation and for typing. It’s not set in stone, so if it makes more sense as methods on a class, we can do that.

Sure. The current design (as of 2.17) requires changing all the signatures and breaks backward compatibility with every new output decision. I now wonder if we all even agree that the current situation is problematic.

The more recent designs from @martinmodrak and @sakrejda, in my understanding, were attempts to solve the problem of having to plumb in new writers and then plumb them down to where they’re used every time we needed new output. Like our original design, they were focused on type-safe, algorithm-aware, structured output with C++ handlers on the interface side.

@seantalts has been gradually convincing me that the focus on type safety is largely misguided in these contexts, as the interfaces want more flexiblity in what they do with output without having to write new C++ handlers.

All else being equal, human readable output is preferable because then someone at a later date can just grab it and process it however they want.

It also makes it easy for the interfaces to procecess without getting third-party libraries like protobuf involved.

That’s the kind of thing I was imagining. With something like methods on the C++ side for all the differnet types of output. With a static logger controlled by tags and default conversions, this could be just

LOG_("WARN", x);

but as @betanalpha pointed out, sometimes that output needs to get converted back to constrained format, as in:

LOG_("DRAW", model_.constrain_(theta));

This is the same issue that @betanalpha brought up. The way we handle this doesn’t need to change. Assuming we plumb an mcmc_writer down to the algorithms, then we can just use a new form of mcmc_writer as in:

LOG_("DRAW", mcmc_writer.foo(theta));

You can construct a std::vector out of any iterator over values, so that’s no problem. There’s no more direct way to serialize to file that doesn’t involve writing a custom allocator for std::vector.


#21

string vs. protobuf is a false trade-off. Some things about C++ are hard but allowing plug-ins (of the kind we currently have with writers) for handling details of output format is easy in C++, why force the decision?

The size-on-disk is not the barrier to using ASCII, it’s the time it takes to load it.

rstan, for large enough output and for deeply nested arrays must use the .csv output due to memory limits and due to issues caused in R/Rcpp by deeply nested std::vector<std::vector<...>>. Reading .csv (or any other text-based format, .csv is fairly amenable to optimization) is a bottleneck, as you can see in the discussion I linked.

My point was that whatever way you do it, it’s not like there’s massive complexity in our output types that interfaces have to handle. It’s the routing that’s harder to deal with.

This hasn’t been articulated outside of conversations that happened off-discourse. Your proposal sounds like you want to stringify deep in stan-dev/stan, in the same fashion that our mass matrices are currently stringified. That seems like a big cost since it’s one of the reasons there’s no clean code for extracting the mass matrix in rstan.

This discussion is going to be a lot more productive if you can be specific about what “complete decoupling” is. In your proposal I see three features: 1) text tags; 2) stringify everything early (and therefore disallow the plug-in approach to output type handling; and 3) use a static object to handle all output. So, what do you actually want these design choices to accomplish?


#22

Good point - let’s jump up to the goal level here. We have at least two goals (please jump in with more):

  1. Make it easier to add (or ideally, even change) a new type of data output
  2. Make it easier to add a new algorithm, service, model method, etc
  3. [edit - added by @sakrejda] Make it easy for the interfaces to handle the output.

Generally we want to avoid thinking a lot about logging whenever we’re doing some of these unrelated tasks, and we also want to avoid thinking about upstream (interface-level) consequences as often as we do. Not thinking about logging means not having to write all methods with 6 writers to satisfy an interface when creating a new algorithm. When avoiding upstream concerns, it’s highly prized to be able to add a new output (e.g. trajectories) without breaking existing interface code. You might also want to be able to make slight tweaks to the formatting or exact data sent for particular types of data without needing to change code in all of the interfaces and other repos (something protobuf as a protocol could help with - change the specification once, have that percolate everywhere as long as the interfaces aren’t using the specific fields you’re messing with. This can also work with cautiously designed ASCII output).

I don’t think I understand what “the plug-in approach” refers to… Can you tell me more? If you’re talking about your post above proposing using templates to figure out how to serialize different data types, I think that’s an orthogonal concern at potentially a slightly deeper level not affecting the 1) tags vs multiple streams 2) ascii vs binary 3) static vs local decisions. I think that because 1) output to a file or files is handled much later 2) you can still use that coding technique to serialize a std::vector<double> to ASCII, but we also need to write out what that data represents and 3) we could pass such a function along or refer to it globally equivocally.


#23

Something like callbacks where the decision about exactly how to handle output is punted to some other code. I think this affects #2 (ascii vs. binary) because once you go text you can’t sensibly go back. It doesn’t affect #1 (tags vs. multiple streams) because both capture routing information ok.

#3 (static vs. local object) seems like a separate concern. Having gone back over the discussion I’d like to know what you think about having a more standard division: pass local objects to handle routing of output and allow logging to be a global static object. For logging a static object makes sense because it might be called from basically anywhere. For output you should only have to pass the local object to a service method so a static object doesn’t have much benefit.


#24

I think a good third goal is to make it easier for interfaces to handle the output. CmdStan, rstan, and pystan can all handle getting a *double + size (as can any language with a C API) and it would be trivial to write default handlers that stringify everything. If we stringify early we’re forcing rstan and pystan to transform back to binary floating point representation.


#25

Gotcha. It’s not impossible to come back from text, but having more structure is way better, agreed. Seems like maybe this issue could be separated a little further - even if you want to eventually pass ASCII to RStan et al, you could delay the stringification considerably (enough to allow for plugins to modify the data qua data on the way out).

Honestly I am not familiar enough with the code to comment here - I could see it going either way depending on how often we’re writing data (vs logging). Can I tag in @Bob_Carpenter?

Yeah, that’s a reasonable goal. I’d also put it at the end of that list, I think, if only because in my experience writing the serialization layer is usually a fairly trivial part of an application (even if one creates one’s own ASCII format, not that we couldn’t use something existing). That said, no reason to make extra work for people if there’s a library that exactly meets our needs for serialization and has stubs for our languages.


#26

Most of our text output comes from either:

  1. the messages in exceptions from the math library that get caught by the algorithms,

  2. errors in the algorithms like log densities evaluating to zero, and

  3. direct algorithm output, like iteration number.

All of the structured output for draws comes from the algorithms, but that’s largely driven by calling the model class’s method write_array(), which converts unconstrained parameters to constrained parameters, transformed parameters, and generates the generated quantities.

We haven’t coded things like trajectories yet, but those will presumably look like our sampling output.


#27

That has also been my experience. Even when I had to use elaborate Java serialization hacks for forward compatiblity [that post is wrongly atributed to Breck, like most of the old posts on that site—one of the reasons I really dislike WordPress].


#28

This is pretty easy to establish by a find/grep call, I’ll see if I can produce it.


#29

I should have been more specific, but what I really meant was some sense of how deeply nested the data output calls are, some sense of how unwieldy it would be to thread just the data writers through broadly - I think this is sort of intuitive and maybe comes mostly from experience (though perhaps there are metrics that could approximate that). Happy to defer to you two here.


#30

That’s what I understood too. I did this by find/grep and checking every file when I did the summary of our current usage of writers so this can be answered fairly quickly by checking those results again.