Schema for callback writers: one step towards retiring CSV

@ariddell, I’ll go through and update the wiki page in the next couple weeks.

I don’t think you’ve captured enough output with

  • vector<string>
  • vector<double>
  • string

The output is a little scattered right now. This is working off my memory. Something like the default NUTS algorithm, you’ve got:

  • One time output:
    • From CmdStan: configuration. Maybe Stan doesn’t need to worry about this and that’s fine.
    • header. This is currently conditional on whether diagnostics is turned on or off, but here’s where the trickiness starts:
      • every constrained parameter gets listed (string)
      • there are sampler parameters like divergent__ and treedepth__
      • lp__. This isn’t sampler related. This is different.
      • then there’s some diagnostic information like the unconstrained parameters
    • adaptation info: metric, stepsize, etc. This doesn’t fit nicely into a vector<string> or vector<double>
    • elapsed time
  • per iteration output
    • everything that’s listed in the header. I really think it’s a few different things outputting per iteration:
      1. log probability
      2. constrained parameters
      3. sampler information
      4. unconstrained parameters

Now that the logging is split out, we really don’t need to pump out strings and should disable it.

Our current implementation has things output in a consistent order, which is nice. It simplifies logic, but at some point, we might image it doesn’t (if we go multithreaded).

Something I always wanted to do, but it was always too cumbersome to do so, was to add an ability to output the trajectory for a single iteration. This was earlier on when I wanted to debug something like divergences. It’d still be cool to do.

I’m not a big fan of relying on maps. I’d rather define a schema instead. (The implementation could be done through a map.)

There’s also optimization and ADVI and diagnostics that output different things that need to be accomodated.

And per log density evaluation in all of these interfaces we have print statements from within Stan and also error messages coming from trapped exceptions.

The adaptation info is metric and stepsize, but might also include size of last batch or something if you really want to be able to restart adaptation—even so, it wouldn’t be the same as just having run it because of the rounding for the final epoch.

I like the idea of saving the config with output. This is a generally messy key-value data structure where the values can be all sorts of things. For instance, they might include mass matrices and names of algorithms (e.g., static Euclidean HMC).

I think it would make a lot of sense to separate the diagnostics, the log density, and the actual parameter values (and perhaps even split those out from derived quantities). The reason I say that is that they’re currently being used for different purposes within the R interfaces and they must be being coded by name matching somehow.

Are you imagining print statements in a Stan program going to the logger? Should we then extend Stan to have a logging level inside of it for its own print statements?

1 Like

I was thinking that the informative keys could be discarded if they were not wanted. cmdstan need not output them to disk in ascii format.
The goals are (1) move away from CSV as the (implicit) schema and (2) provide additional context to a message which is sent via a callback writer to the interface.

OK, so I’ll concede that the schema has a problem here. What about dumping the key-value structure into a JSON string and sending that? Everyone can read JSON.

I don’t see an alternative really – you can’t have a std:map<string, std:map…> of arbitrary depth in C++, right?

yeah you can, the value ends up being either a type, a vector, or another map and you use one of the variable types for it. Not sure if there’s a boost implementation or somethin glike that we could just borrow.

Interesting. Maybe that could work then.

There are many ways to code arbitrary depth objects in C or C++. One example in Stan is the AST for the language parser.

The way that would look most like Python is to use pointers, store type info in the object in object readable form, then cast pointers at runtime.

The more idiomatic way to do it in C++ (aka C++thonic) would be to do it with variant types as we do in the AST. That way, everything’s still statically type checked. It’s much more effort to write through callbacks, but you get static type checking with no possiblitiy of run-time type casting errors. We’re using the Boost variant lib for everything.

If the primary goal is to allow a function to emit a record (for
posterity) of its config via the callback writer, I think the
serialize-to-json and emit a string is a good approach. Recall that
every interface has a JSON reader while not every interface can easily
wrap complex C++ code.

Absolutely—I didn’t mean to confuse internal and serialized representations. Only a C++ purist would want to deal with variant types a la Boost.

I was just responding to your question, “you can’t have a std:map of arbitrary depth in C++, right?”

Mind if I shift the discussion a little bit? Still on topic, but thinking about the output a little differently.

I’d prefer us moving away from the mindset of having an all-encompassing writer that accommodates whatever output we want to throw at it. I think that’s what is making this a difficult problem.

Instead, we can think about once we select an algorithm, we know what will be produced. Coupled with the data, we know the output down to the sizes. We should find a way to describe that. One extreme way to encode this is to have a different type of writer for each piece, like a writer for draws, a writer for the adaptation parameters, a writer for the elapsed time, etc. We can imagine there’s some fixed C++ structure that holds everything. Then it’s a matter of serializing and deserializing that. Or we can build some sort of schema that describes it. But we should be describing that, not the message-level information. (btw, this applies to optimization and ADVI too.)

The only complication here is that we consider not serializing everything. And I think that’s fine. We just have to know what’s minimal or what we can do with pieces.

Anyway, that’s how I’m thinking about it. Thoughts? (I don’t care if the implementation is done through a class that can handle everything, but trying to piece together structured output with individual key-value pairs seem like we’re making the problem a lot harder on ourselves.)

1 Like

I’ve been playing around with this (it’s easier for me to think about it if I just make some classes) and I agree. I think much like you did for CmdStan arguments we could make a pretty straightforward class hierarchy for the config info and make those objects members of a class for each algorithm. It would be nice to leave serialization/deserialization as some sort of plug-in class (maybe passed as a template argument) so that the config could be dumped to whatever preferred format people want without rewriting the C++ structure.

This could just be no-op methods on the plug-in serialization object. I’m a fan of punting design decisions down the road.

I thought that was the plan all along!

2 Likes

Just wanted to check in and say that adding more doc is on my to-do. Hopefully by next week I’ll have updated the wiki and we can continue pushing on this.

1 Like

@sakrejda and @ariddell, I updated this wiki with some of my thoughts. (it took me much longer to dig through this stuff again)

Thoughts? I didn’t get into the details on how to serialize and deserialize, but the more I think about it, the more I believe we don’t need to pass all the meta-data around if we knew what algorithm we ran.

I love it, I can see wanting some shims/standard writers to make things backwards compatible since we can’t adjust interfaces immediately. Happy to help implement some of the pieces. The only parts I wasn’t clear on was this:

A callback for things that pertain to sampling, so this should just accept lp__.

Could you elaborate on that some more? Why is lp__ different? Doesn’t optimize/etc… also have an lp__? What else is like lp__ in this context? That sort of stuff.

1 Like

I like the proposal. Things will be much clearer.

My one proposal is to try using key-value(s) pairs for everything.
There’s no performance hit (readers, that is providers of callback
writers, can ignore the keys if they want) and it makes things much more
explicit. Apart from “Explicit is better than implicit” [1], there
really is a need for explicitness given how similar the outputs of the
sample writer and the diagnostic writer are.

[1] https://en.wikipedia.org/wiki/Zen_of_Python

On the C++ side the sample writer receives the parameter names and then at each iteration it receives a vector of parameter values. Is this sufficient or are you saying that it should always receive key-value per parameter? The reason I ask is that it’s straightforward for the writer to cache the parameter names and send them along with values if needed but doing it the other way requires packing a std::vector of parameter values into key-value pairs and then unpacking them to write the vector.

Yes, I think we should “write” the parameter names each iteration. The
callback writer can discard keys if it wants. The “write the parameter
names first, then write the samples in the same order” is just too
esoteric for my taste. Let’s just tell the callback writer exactly what
we’re writing. If we’re using key-value(s) pairs elsewhere, we might as
well do it everywhere to be consistent.

Apart appealing to consistency and “explicit is better than implicit”,
I’ll say that getting away from the header-then-rows would be valuable
insofar as it does move us explicitly away a CSV format.

Assuming there’s no performance hit, is there an argument against this?

A table is not esoteric and there is a performance hit. Parameters are
stored as a vector not as key-value pairs so in your scheme they would need
to be re-packed for calling the writer and re-packed again to write as a
table.

What are we meaning, when we talk on performance hit. Scale for key-value pairs is something like ns-ms (-s) vs sampling takes minutes-hours