Schema for callback writers: one step towards retiring CSV


#1

Right now the stan::services functions push data out to the callback writers in a format which is intended to be written directly to CSV. Changing the format these writers emit to something more general is a useful first step in moving away from CSV as the serialization format.

I’m proposing the following schema be used by the callback writers going forward. The spec is written using the protocol buffers language but this shouldn’t imply that I expect the callback writers to output protocol buffer messages directly.

message WriterMessage {

  message StringList {
    repeated string value = 1;
  }
  message DoubleList {
    repeated double value = 1;
  }
  message IntList {
    repeated int64 value = 1;
  }

  message Feature {
    oneof kind {
      StringList string_list = 1;
      DoubleList double_list = 2;
      IntList int_list = 3;
    }
  };

  string topic = 1;
  map feature = 2;
};

So a draw written by a sample_writer would write a C++ version of this (trimmed for brevity):

topic: "sample"
feature {
  key: "divergent__"
  value {
    int_list {
      value: 0
    }
  }
}
feature {
  key: "lp__"
  value {
    double_list {
      value: -0.259381
    }
  }
}
feature {
  key: "y"
  value {
    double_list {
      value: 0.720251
    }
  }
}

Discarding type information, this can be converted to JSON with google.protobuf.json_format.MessageToJson (Python):

{
  "topic": "sample",
  "feature": {
    "divergent__": {
      "intList": {
        "value": [
          "0"
        ]
      }
    },
    "lp__": {
      "doubleList": {
        "value": [
          -0.259381
        ]
      }
    },
    "y": {
      "doubleList": {
        "value": [
          0.720251
        ]
      }
    }
  }
}

(A translation of this schema into a C/C++ struct with a union type or similar would be welcome.)

Problem background

To appreciate the problem the schema solves, consider the way things work now:

We call, say, stan::services::sample::hmc_nuts_diag_e with five writers

  • message_writer Writer for messages
  • error_writer Writer for messages
  • init_writer Writer callback for unconstrained inits
  • sample_writer Writer for draws
  • diagnostic_writer Writer for diagnostic information

And they output this sort of thing, in the order below. I’ve prefixed the lines with the writer’s name.

message_writer:
message_writer:Gradient evaluation took 3e-06 seconds
message_writer:1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
message_writer:Adjust your expectations accordingly!
message_writer:
message_writer:
init_writer:[-0.32923]
sample_writer:["lp__","accept_stat__","stepsize__","treedepth__","n_leapfrog__","divergent__","energy__","y"]
diagnostic_writer:["lp__","accept_stat__","stepsize__","treedepth__","n_leapfrog__","divergent__","energy__","y","p_y","g_y"]
message_writer:Iteration:    1 / 2000 [  0%]  (Warmup)
sample_writer:[-3.16745e-06,0.999965,1,2,3,0,0.0142087,0.00251692]
diagnostic_writer:[-3.16745e-06,0.999965,1,2,3,0,0.0142087,0.00251692,0.168556,0.00251692]
sample_writer:[-3.16745e-06,0.815142,1,1,3,0,0.818114,0.00251692]
diagnostic_writer:[-3.16745e-06,0.815142,1,1,3,0,0.818114,0.00251692,-1.27915,0.00251692]
sample_writer:[-0.00735183,0.998801,1,2,3,0,0.00904035,-0.121259]
diagnostic_writer:[-0.00735183,0.998801,1,2,3,0,0.00904035,-0.121259,0.0581124,-0.121259]
sample_writer:[-0.8056,0.829279,1,1,3,0,0.937393,-1.26933]
diagnostic_writer:[-0.8056,0.829279,1,1,3,0,0.937393,-1.26933,-0.513407,-1.26933]
sample_writer:[-0.295987,1,1,1,1,0,0.687273,-0.769399]
diagnostic_writer:[-0.295987,1,1,1,1,0,0.687273,-0.769399,0.884631,-0.769399]
sample_writer:[-0.309831,0.996545,1,1,1,0,0.380446,-0.787186]
diagnostic_writer:[-0.309831,0.996545,1,1,1,0,0.380446,-0.787186,-0.375806,-0.787186]
sample_writer:[-1.5297,0.81227,1,1,3,0,1.53352,-1.74912]
diagnostic_writer:[-1.5297,0.81227,1,1,3,0,1.53352,-1.74912,-0.0873717,-1.74912]
sample_writer:[-0.00535958,1,1,1,3,0,1.44586,-0.103533]
diagnostic_writer:[-0.00535958,1,1,1,3,0,1.44586,-0.103533,-1.69735,-0.103533]
sample_writer:[-2.77433e-05,0.999933,1,2,3,0,0.00577986,-0.00744893]
diagnostic_writer:[-2.77433e-05,0.999933,1,2,3,0,0.00577986,-0.00744893,-0.107258,-0.00744893]
message_writer:Iteration: 1010 / 2000 [ 50%]  (Sampling)

Note that to interpret the output of these writers you have to build a state machine which can keep track of mapping between variable names and elements of the draws. If you’re just dumping the output to CSV, this isn’t a problem. It is a problem in all other cases.

Thoughts? I think we could adopt this schema with very modest changes to the callback writers and no changes to the interfaces.


#2

Topic should be an Enum. The revised schema is now:

message WriterMessage {

  message StringList {
    repeated string value = 1;
  }
  message DoubleList {
    repeated double value = 1;
  }
  message IntList {
    repeated int64 value = 1;
  }

  message Feature {
    oneof kind {
      StringList string_list = 1;
      DoubleList double_list = 2;
      IntList int_list = 3;
    }
  };

  enum Topic {
    UNKNOWN = 0;
    MESSAGE = 1;         // non-error messages
    ERROR = 2;           // error messages
    INITIALIZATION = 3;  // unconstrained inits
    SAMPLE = 4;          // draws
    DIAGNOSTIC = 5;      // diagnostic information
  }

  Topic topic = 1;
  map feature = 2;
};

#3

What’s that a scheme for? I don’t understand what you mean when you say it’ll write a C++ version.

The reason we decided on a row-major format with up-front labels is that anything else dumps out a lot of redundant information every draw. So if we literally dump out string keys with every iteration, this is going to generate way too much output.


#4

“C++ version” would be the C++ equivalent of the protobuf message. (There’s a straightforward mapping of protobuf types to C++ types.)

The schema is intended to formalize what the stan::services functions are doing with their various stan::callbacks::writers. Right now a great deal is left undefined. This schema is one way of making it explicit.

An interface such as cmdstan is free to ignore the extra context the schema provides and continue to dump, say, draws to CSV. The schema will help interfaces which don’t want to use CSV.

My sense is that the schema also could solve the serialization problem since draws (and everything associated with the draws, including diagnostic messages, init messages) could be saved in a file as a sequence of these protobuf messages (possibly converted to json).


#5

Does the schema get encoded in every iteration dump or is it standoff? I’m worried about efficiency if it’s not standoff. That was where we got stuck the last time we had this discussion.

Yes, it would solve the incremental write problem. But then CSV can solve that problem, too, just not the incremental restart one.

Don’t protobufs write in binary? That would be much more efficient for machine-to-machine communication. JSON’s of course easier for humans to browse (and perhaps other programs).


#6

It’s standoff. And it’s binary (but has a defined/canonical translation
into JSON).

I’ll dogfood this with httpstan and translate the CSV-like output of
stan::services::sample::* into this format. If all goes well I’ll let
everyone know.


#7

I think I’m missing something. Just trying to step back and see the big picture, it looks like you want to use either a DSL or some other thing to describe the output layer. I’d prefer not to do that, if that’s what you’re suggesting.

First, our current design isn’t how it will continue. Having all the writer classes be the same was a compromise to get the refactor in. Going forward, writers will have different classes and have different purposes. It’ll make the design cleaner and easier to keep straight, at the expense of having a few different classes. I think it’ll be well worth it.

The goal would be to have a common way to serialize and deserialize these things consistently, which is hard right now.


#8

The goal here is to make the output of the current callback writers modestly more self-describing in the short term for the interfaces which have to post-process the output. Right now the output “format” emitted by, say, sample_writer or diagnostic_writer (e.g., headers followed by samples) is not documented anywhere, except implicitly by the cmdstan CSV format.

If there’s a more robust solution in the works, that’s great. Is there an issue on github that I can follow?


#9

One thing that I found challenging when writing a parser for the output of the writers is the adapt messages. If there’s adaptation, then the sample_writer actually doesn’t write samples initially, it first writes some adaptation info:

sample_writer:"Adaptation terminated"
sample_writer:"Step size = 0.809818"
sample_writer:"Diagonal elements of inverse mass matrix:"
sample_writer:0.961989
... then headers, then samples

Rather than add an adapt_writer, which seems like what one should do in the current setup, one could just add a new topic type to the schema, topic = ADAPT. Of course, if stan::services used the schema, we wouldn’t need multiple writers at all, we’d just need one.

I’m a devoted fan of the services refactor and I’ll be happy with whatever solution ends up getting implemented.


#10

That’d be great then if this is all standoff schemas. The one thing that tanked the proposal before was the thought of writing verbose self-describing JSON at every iteration. The I/O isn’t a bottleneck in CmdStan (well, the input is for very fast processes like linear model optimization), so having something in a binary format would be even faster.

Are there readers in R, Python, etc. for the schemas you’re talking about?

We absolutely need to move our I/O into some kind of more standard format. It’s one of the checkboxes on everyone’s “will your project survive” checklist.


#11

@Bob_Carpenter: what are “standoff schemas”?

@ariddell, unfortunately, no. I’ll start putting together something more formal in the next couple weeks. Right now, I’m working through the logging, but the others will need a lot more thought. Here are bits and pieces of thoughts (all on the wiki):

  1. https://github.com/stan-dev/stan/wiki/Design:-Consolidated-Output-for-Sample,-Optimize,-ADVI,-and-Diagnose
  2. https://github.com/stan-dev/stan/wiki/Protocol-Buffers-for-serialization-of-input-data,-output-samples,-initial-values,-input-parameters,-and-output-messages,
  3. https://github.com/stan-dev/stan/wiki/Output-format

I completely agree.

Maybe you can explain how you’re envisioning schemas and topics. I don’t see how this makes it less complicated, but I could be missing something.

(Once upon a time I used to work for a company that built middleware and installed it in places. It had these schemas and other things and it was a real mess to have a generic middleware with a specific layer written in something else.)

There are two issues with the serialization that we haven’t quite gotten a handle on:

  1. We don’t have a representation that can be serialized and deserialized. Part of the problem is that we haven’t enumerated the output (like what @ariddell has done… I’ve done the exercise, but didn’t write it down). If we knew what sort of objects needed to be constructed from deserialization, then it doesn’t quite matter what sort of serialization happens as long as it’s fast enough, small enough, and as a secondary concern, something standard.
  2. We haven’t considered the chronological order of the output. What parts are required to be in chronological order and what parts aren’t? Right now, we assume everything is in chronological order and that’s fine.

#12

Yes, everything can read protocol buffers with the schema. We do have to worry about distributing the schema definition itself (a .proto file). Distributing it with Stan would make sense, I think.

One thing that I especially like about protocol buffers is that there is the canonical transformation into JSON. If people want to save their draws in a version that’s text-based and mostly self-describing (i.e., not standoff, with the parameter names associated with every draw), they’re free to do so!


#13

At the C++ level, I’m indifferent between the (1) one callback writer (class?) for every type of message and (2) a schema with a writer type (“topic”) enum. Thinking more broadly, however, it seems like protocol buffers or something similar does offer a language-neutral way of specifying what kind of data the stan::services functions are going to spit out. It also goes some way towards solving our serialization problem since you can serialize the cumulative output of all the callback writers (if they conformed to the schema) to disk and call that the “fit”.


#14

@ariddell, should we try to see what a protocol buffer description of a sampler looks like? (By we, I really mean at least a two-person effort.) I’d be happy to see if it’s feasible to describe the output out of even the default sampler in a protocol buffer.

Knowing what it does now, my fear is that it’s not going to be easily jammed into that sort of format, especially across different types of samplers. But, maybe if we actually wrote a prototype, we might be able to tame the output and force it to adhere to something reasonable.

The more I think about it, the more I think we shouldn’t let the current state dictate how we move about things, so it’d be good to see what we can do here. I’d really not like to make the diagnostic output disappear, but maybe we’ll need to.

Let me know what you think.


#15

I’ve tried to do various bits of this before so I would also like to be involved.

I have a branch somewhere with protobuf writer callbacks, tests, and I got some rough timings for it, I did roughly what @ariddell is suggesting here with the schema and it was definitely fast enough as a first pass but we hadn’t agreed on what the c++ needed to look like so I never pursued it. I do think getting the writers on the c++ side figured out is key.

Are you concerned about saving the sampler state itself? We should probably define message types for each of the bits of data in the sampler that normally get output in .csv, it’s pretty straightforward to do that but I’m not sure I understand what your actual concern is?


#16

@syclik By standoff, we just mean the usual—the schema isn’t packed up with the data itself as it’s shipped. Specifically, I was worried about shipping a full schema with each iteration. That’s why we decided an iteration-based JSON approach wouldn’t work.

@ariddell Great news that we can auto-convert to JSON. If the libraries are suitably open sourced (i.e., not GPL-ed), we should be able to use them in CmdStan.


#17

Concretely, I think there is a way to move slowly away from the current setup, where callback writer only outputs the following:

  • vector<string>
  • vector<double>
  • string

to a writer which outputs

  • map<string, vector<string>>
  • map<string, vector<double>>

which, in theory, is easy to translate into the protobuf schema in the original post. Getting the protobuf schema into Stan doesn’t strike me as that important (it can live in a third party library if need be). Moving away from a system that is wedded to CSV is the most important.

If the sample writer were to emit samples with keys, e.g., {'lp_': [-1243.0], 'y': [1.2], ...} instead of an opaque list of doubles, it would be a lot easier to translate things into some other (non-csv) format. Having an informative key is especially important for output such as the adapt output. We need some context to know that [-1243.0, 1.2] is a draw rather than, say, the diagonal elements of the inverse mass matrix. (The latter would become, in this new setup, {'diag_inverse_mass_matrix': [-1234.0, 1.2]}.) (I’m using JSON notation to describe a C++ std::map<std:string, std::vector<T>> here.)

This change could be made without too much work and would, I think, improve things a lot. Maybe an adapt_writer could be introduced as well (in the stan::services::sample functions)? All this put together would make the “output format” of the stan::services::sample functions much more self-describing.

edit: clarifying edits


#18

We don’t need to define the writer outputs so narrowly, that’s the whole point of having writers. The input sure, we want what you’re describing to be possible so it would be nice to type the messages a little more and get away from text parsing. For the output an obvious use case is for a writer that outputs the header once and then outputs vectors of samples. This makes sense whether you’re talking about .csv or protobuf. So you need:

  • vector<string> for headers
  • vector<double> for samples
  • map<string, vector<double>> for mass matrix
  • map<string, string> for config and timing info, we could type these further so writers don’t have to cast I guess

#19

So basically the sections below here: https://github.com/stan-dev/stan/wiki/Design:-Consolidated-Output-for-Sample,-Optimize,-ADVI,-and-Diagnose#proposed-hmc-output


#20

I hope nobody thought I was suggesting that we write our own protocol buffer parser.

Mitzi and I wrote an event/callback-based JSON parser way back because there wasn’t a good open-source one we could integrate.

I do not want to put these informative keys into our output. If the output can be transfomred into something like a dictionary string that can be read back into Python, that’s fine by me, but the overhead of dumping all those characters out will take up a lot of space!

I didn’t understand what you meant by writers outputing maps. Outputs to where or to which client?