Stan CSV file format

Just updated the CmdStan chapter on the Stan CSV file format: https://mc-stan.org/docs/cmdstan-guide/stan-csv.html

The difficulty of parsing the Stan CSV files using standard CSV parser packages came up here:

This inspired me to document the gory details of the sampler outputs. I hope this is useful to others and for the ongoing discussion of better and faster outputs from the sampler and other Stan services. As always feedback welcome, also help fleshing out descriptions of sampler outputs for the other methods.

6 Likes

If I ignore performance for a moment:

Parsing the header by grouping against whitespace is not optimal.

Maybe verbose names could be better?

hi @ahartikainen, not sure I follow your comments - what’s the context here?

The header has multiple levels. Sure, human can read it easily, but parsing the lines with code is a bit harder --> one needs to follow what block is going on. And couple of copy-paste errors (done by human) whitespace groups can dissappear. Then one would need to know all arguments for all samplers to get back the correct structure

Good:

# stan_version_major = 2
# stan_version_minor = 24
# stan_version_patch = 0
# model = bernoulli_model
# method = sample (Default)

Not bad, but

#   sample
#     num_samples = 100
#     num_warmup = 200
#     save_warmup = 1
#     thin = 1 (Default)
#     adapt
#       engaged = 1 (Default)
...

Is basically same as

#   sample.num_samples = 100
#   sample.num_warmup = 200
#   sample.save_warmup = 1
#   sample.thin = 1 (Default)
#   sample.adapt.engaged = 1 (Default)
3 Likes

took a look at the CmdStan code - implementing this would require a refactor of the argument handling code and it would be a lot of work - I’ve messed with that code before - it burned up about of week of dev time between me and Daniel Lee - not worth it.

almost all of arguments names are unique - with the exception of keyword “file” which is used for data block inputs, parameter init inputs, and algorithm outputs. flattening the argument names along the lines of your suggestion would lead to “data_file” “init_file” and “output_file”. note that the sample method already has keyword “diagnostic_file” which is a step in the right direction. at which point, white space wouldn’t matter.

also note that “init_file” for specific parameter inits would allow the use to also specify the init range for all other parameters - the services layer interface allows this, it’s a limit of the current CmdStan set of argument names.

actually, you can specify “output diagnostic_file=foo.csv” for any method - not sure if any methods besides ‘sample’ do anything - maybe ‘vb’ does?

1 Like