Right now the stan::services functions push data out to the callback writers in a format which is intended to be written directly to CSV. Changing the format these writers emit to something more general is a useful first step in moving away from CSV as the serialization format.
I’m proposing the following schema be used by the callback writers going forward. The spec is written using the protocol buffers language but this shouldn’t imply that I expect the callback writers to output protocol buffer messages directly.
message WriterMessage {
message StringList {
repeated string value = 1;
}
message DoubleList {
repeated double value = 1;
}
message IntList {
repeated int64 value = 1;
}
message Feature {
oneof kind {
StringList string_list = 1;
DoubleList double_list = 2;
IntList int_list = 3;
}
};
string topic = 1;
map feature = 2;
};
So a draw written by a sample_writer
would write a C++ version of this (trimmed for brevity):
topic: "sample"
feature {
key: "divergent__"
value {
int_list {
value: 0
}
}
}
feature {
key: "lp__"
value {
double_list {
value: -0.259381
}
}
}
feature {
key: "y"
value {
double_list {
value: 0.720251
}
}
}
Discarding type information, this can be converted to JSON with google.protobuf.json_format.MessageToJson
(Python):
{
"topic": "sample",
"feature": {
"divergent__": {
"intList": {
"value": [
"0"
]
}
},
"lp__": {
"doubleList": {
"value": [
-0.259381
]
}
},
"y": {
"doubleList": {
"value": [
0.720251
]
}
}
}
}
(A translation of this schema into a C/C++ struct with a union type or similar would be welcome.)
Problem background
To appreciate the problem the schema solves, consider the way things work now:
We call, say, stan::services::sample::hmc_nuts_diag_e with five writers
- message_writer Writer for messages
- error_writer Writer for messages
- init_writer Writer callback for unconstrained inits
- sample_writer Writer for draws
- diagnostic_writer Writer for diagnostic information
And they output this sort of thing, in the order below. I’ve prefixed the lines with the writer’s name.
message_writer:
message_writer:Gradient evaluation took 3e-06 seconds
message_writer:1000 transitions using 10 leapfrog steps per transition would take 0.03 seconds.
message_writer:Adjust your expectations accordingly!
message_writer:
message_writer:
init_writer:[-0.32923]
sample_writer:["lp__","accept_stat__","stepsize__","treedepth__","n_leapfrog__","divergent__","energy__","y"]
diagnostic_writer:["lp__","accept_stat__","stepsize__","treedepth__","n_leapfrog__","divergent__","energy__","y","p_y","g_y"]
message_writer:Iteration: 1 / 2000 [ 0%] (Warmup)
sample_writer:[-3.16745e-06,0.999965,1,2,3,0,0.0142087,0.00251692]
diagnostic_writer:[-3.16745e-06,0.999965,1,2,3,0,0.0142087,0.00251692,0.168556,0.00251692]
sample_writer:[-3.16745e-06,0.815142,1,1,3,0,0.818114,0.00251692]
diagnostic_writer:[-3.16745e-06,0.815142,1,1,3,0,0.818114,0.00251692,-1.27915,0.00251692]
sample_writer:[-0.00735183,0.998801,1,2,3,0,0.00904035,-0.121259]
diagnostic_writer:[-0.00735183,0.998801,1,2,3,0,0.00904035,-0.121259,0.0581124,-0.121259]
sample_writer:[-0.8056,0.829279,1,1,3,0,0.937393,-1.26933]
diagnostic_writer:[-0.8056,0.829279,1,1,3,0,0.937393,-1.26933,-0.513407,-1.26933]
sample_writer:[-0.295987,1,1,1,1,0,0.687273,-0.769399]
diagnostic_writer:[-0.295987,1,1,1,1,0,0.687273,-0.769399,0.884631,-0.769399]
sample_writer:[-0.309831,0.996545,1,1,1,0,0.380446,-0.787186]
diagnostic_writer:[-0.309831,0.996545,1,1,1,0,0.380446,-0.787186,-0.375806,-0.787186]
sample_writer:[-1.5297,0.81227,1,1,3,0,1.53352,-1.74912]
diagnostic_writer:[-1.5297,0.81227,1,1,3,0,1.53352,-1.74912,-0.0873717,-1.74912]
sample_writer:[-0.00535958,1,1,1,3,0,1.44586,-0.103533]
diagnostic_writer:[-0.00535958,1,1,1,3,0,1.44586,-0.103533,-1.69735,-0.103533]
sample_writer:[-2.77433e-05,0.999933,1,2,3,0,0.00577986,-0.00744893]
diagnostic_writer:[-2.77433e-05,0.999933,1,2,3,0,0.00577986,-0.00744893,-0.107258,-0.00744893]
message_writer:Iteration: 1010 / 2000 [ 50%] (Sampling)
Note that to interpret the output of these writers you have to build a state machine which can keep track of mapping between variable names and elements of the draws. If you’re just dumping the output to CSV, this isn’t a problem. It is a problem in all other cases.
Thoughts? I think we could adopt this schema with very modest changes to the callback writers and no changes to the interfaces.