Schema for callback writers: one step towards retiring CSV

We don’t have to go with key-value pairs at all!

When we set up a writer for the parameters, on construction, we know what the value is. If anyone tries to pass something different it would be a developer error.

I’ll put up more detail later and maybe it’ll be straightforward enough.

My main concern is pushing a key-value data structure into the stan C++ code when there’s no reason to do it: a given callback can always produce key-value output (I’ve written such writers for both ProtoBuf and PostgreSQL for testing). The CmdStan format has plenty of flaws but writing a table is not one of them.

I’ve provided two concrete reasons to do it (consistency/principle-of-least-surprise and moving away from (implicit) CSV). I haven’t heard any reasons not to do it. If there’s a performance hit which is more than 5ms per 1000 samples I’d be shocked. (Writing samples isn’t the bottleneck in any event.)

@sakrejda, looking at the code, I don’t think my proposed strategy requires much more “packing”. Check out https://github.com/stan-dev/stan/blob/develop/src/stan/services/util/mcmc_writer.hpp I believe there’s a vector of doubles stack allocated and then filled by the sampler. That vector is passed to the callback writer. With the scheme I’m proposing you stack allocate a vector of string, double pairs, fill it, and pass it to the callback writer. I don’t think there will be much of a performance hit.

If you think I’m completely wrong about the magnitude of the performance hit – i.e., it’s much more than 1-5ms per 1000 samples – naturally that changes the discussion.

Great questions!

I’ll start with what we currently work with:

  • in sampling, lp__ is the log joint probability distribution evaluated at the unconstrained parameter values including the Jacobian term and up to an additive constant (proportionality constant on the log scale).
  • in optimization, lp__ is the value of the objective function, which is the log joint probability distribution evaluated at the unconstrained parameter values up to an additive constant. There is no Jacobian term. On any model with a non-trivial transform, if you took the same unconstrained parameters, you would get different values under sampling and optimization.
  • in ADVI, lp__ is always 0. It’s unused and in the inference portion of the algorithm, it would probably make sense we to replace that with the elbo or something else. Post-inference algorithm, ADVI also draws Monte Carlo samples from the variational solution. I’m pretty sure the lp__ value isn’t computed here.

I’m thinking ahead. We’ve already seen it with ADVI, but there may be new algorithms in the future that don’t report lp__. The all have to report something, but they should report what they need instead of trying to conform to how sampling deals with it. And… we’ve overloaded the meaning of lp__ across sampling and optimization. (I’m not sure what fixed_param returns for the lp__ value. It could report a number of things including 0.) Also, with Michael’s improvements to the algorithms, there might be additional things that are reported. Perhaps not at the sampling algorithm level, but definite within algorithms.


Our current design is hindering us from moving forward quickly with updates to sampling algorithms. We’ve had issues with being able to move quickly here because we’re putting these things together: lp__, sampler parameters (tree_depth__, etc) and the actual values. If we had those three things separately, it would be easier to report additional fields (like energy__ or something new!) without having it affect the other two things which really shouldn’t change at all.

Maybe it’s true that there’s no substantial performance penalty to this, but you’re trying to make C++ pythonic which makes no sense. This can happen in the writer (with no substantial performance penalty) without messing with the internal data structure used by the sampler.

Forcing output with inconsistent requirements into one format is not a virtue and what you’re calling implicit CSV is a table—a widely used and understood format.

This is a minor concern.

Ok, that makes a lot of sense. You might save some trouble if you add this explanation to the wiki (maybe I missed it).

Yikes, yes, now I recall all the questions we’ve gotten about this on the mailing list. Would be great if lp__ had one meaning.

Here’s a very good reason not to do what you suggest: a future developer will try to use the same writer for multiple uses. It’s what the design begs for. I’d rather have the names of the parameters be an invariant between calls of the writer. Unless we’re testing the stacked strings at each call, I don’t think this is the right thing to do.

I agree that speed isn’t an issue. The issue is the purpose of the writer and having the design speak to that.

The multiple uses thing is a concern but a red herring, as you should
know. In the current setup the sample writer is already being misused,
that’s why we’re changing things. If the key-value(s) writer is more
susceptible to misuse than the already-susceptible callback writer, I’d
like to know why and by how much.

At least give the sample params callback writer a separate class, like
table_writer or something.

Agreed, these things should have separate classes!

1 Like

I think everyone is forgetting that the internal API doesn’t know anything about the shape of the parameters, rather they are used as a single monolithic vector. When dumping output the model instance is used to transform the unconstrained parameters to constrained parameters. If the writer wants to pass the constrained parameters as key/value pairs then the model class has to be updated to extract those shapes and filter out the appropriate values to the right keys, which will be a bit hairy. Not impossible, but hairy.

Then you’d have to have possible values defined for every type in the Stan language. If the MCMC writer has to do this then you’d be coupling the algorithms with the language, which is a no-no.

Long story short: key/value pairs would require a huge amount of processing to go from algorithm output to a key/value writer.

I’m not forgetting this.

The (flat) parameter names are available to mcmc_writer without too much
work. I think these should be included with the values rather than
provided as a “header”.

This is a weakness in the API. Mav had to write a bunch of code for rstan a long time ago just to know what the dimensions of things were.

But they’re not. The names are also flattened out by the model instance to be 1-1 with the parameter values, so without doing any manipulation we’d just have a key/scalar value pair for each component which wouldn’t give anything different then the header/vector of parameter values.

It would be different. It would be (1) more explicit and (2) consistent
with the key-value format of the configuration writer.

I think we’re talking about dumping out keys every iteration. Very rarely do the iterations take minutes or hours. I/O can dominate compute time in large optimization problems if the I/O is inefficient (as coming in through our dump format in CmdStan).

1 Like

My money’s on more than 1–5 ms per 1000 iterations, but it’s going to depend on how many parameters there are.

I think “consistency” is largely in the eye of the beholder. Could I criticize a Python program that used an array for one data structure and a dictionary for another as being inconsistent and tell them to choose a data structure and stick with it? I think that’s what we’re talking about here.

I’ve never heard of “principle of least surprise”, so I looked it up: https://en.wikipedia.org/wiki/Principle_of_least_astonishment

I don’t know that people are bringing a lot of preconceptions to how we dump out draws, so I don’t see them being surprised no matter what we do. If they are bringing preconceptions, maybe they come from from working with databases, which are a lot like our MCMC output. There are column names and types, then a bunch of rows. Let’s say you assume users have read the database spec in PEP 249. In that spec, a row of a database table is returned as a tuple of values without the column names.

I’m not sure it’s so much as a red herring, but more poor design. Good design will make using it easy and misusing it difficult.

There isn’t misuse in the sample writer. It’s just the way it was evolved without a real design.

The key-value writer is much more difficult to deal with. Part of the difficulty is social, part of it is technical.

  • social: developers will change the output without thinking of how the interfaces deal with the output. (This is true in practice and not just a hypothetical concern)
    • interfaces maintain their own writers to whatever is convenient in their platform
    • changing something as minor as renaming a key has consequences and shouldn’t be taken lightly
  • technical: reading the output back in is simplified if we’re restricting the output
    • with just a key-value writer, anything can be output.

(sorry, I typed most of this up earlier as a draft… I’m seeing a whole bunch of other conversation, so I’m just pushing this out; I’ll catch up.)

That seems closest to what I’m suggesting, except we won’t allow different types.

For writing draws, all the values are double. I could see us possibly dealing with int generated quantities being stored as int for the constrained parameters.

For the other writers, all the arguments should be typed. There’s no reason we’re jamming tree_depth__ and divergent__ as doubles.

In fairness Bob, the database spec (PEP 249) allows returning rows
(without column names). It never says you can return first a row with
column names and then a row with the values. This “table writer” API
really is something unusual.

Now, we can talk about the IO hit. It would be interesting to see how
costly all those keys are.

That code shouldn’t be there any more. It should be queried directly from the instantiated model class which has methods to provide that information. Let me know if it’s still there and we can replace it.