JSON Output for STAN

avehtari · September 15, 2023, 1:33pm

Parquet support would likely to be added to posterior package, which is currently used by cmdstanr and brms, supports handy presentation of draws as arrays, matrices, data frames, or rvars, and has better ESS/MCSE estimation than coda (see Comparison of MCMC effective sample size estimators). As you mention coda, I’m curious to know what are the key features for you in that package (off-topic for this thread, so you can also send me a private message)?

bredelings · September 15, 2023, 1:49pm

I actually mentioned coda because I’ve seen it mentioned in several statistics papers proposing new MCMC methodology. Since I’m interested in trying to create an interoperable file format, I was thinking that if coda is widely used, then substantial adoption of the format would mean that popular diagnostic packages would need to read the format.

However, based on what Brian and Jason said, I could simply provide a function to read the file format and create the type of R objects that it works on.

(Thanks for the ESS comparison link)

bredelings · September 15, 2023, 2:10pm

Is it currently the case that diagnostics tools exist (besides the provided stansummary) that directly analyze the Stan CSV files?

Hmm. Is that the right question though? Even if no tools exists that can load a posterior sample from a CSV, such tools probably SHOULD exist.

My understanding is that most of these rely on you getting the data in memory yourself and then providing it for analysis (this is certainly the case in Python, where a package like Arviz can accept a cmdstanpy object, but it relies on cmdstanpy to actually populate the draws in that object from disk).

I was thinking that it should be pretty easy to convert an R data frame to whatever object one’s favor R diagnostics tool is using. Since you can read CSV using read.table, csv would then effectively be supported.

Is this not the case?

In evolutionary biology, everybody uses TSV instead, so there are a number of different tools that can read and analyze TSV. Converting CSV to TSV so that it can be analyzed using those tools can be done with a simple sed command.

WardBrian · September 15, 2023, 2:20pm

The answer is “it depends”.

Does your diagnostic tool need to distinguish between “special” variables like the log density and parameters? Then it needs to know about the specific Stan convention of naming it lp__.

Does it need to know if warmup draws are included or excluded from the sample? Then it needs to read the incredibly Stan specific comments in that csv file to see if save_warmup was true, and if so, how many warmup iterations were used.

If your tool doesn’t care about those things, throwing a data.table in is probably correct and fine (and I think you would do the same if you got a data.table from a parquet file). But if it does, you need something to do specific processing or extraction

bredelings · September 15, 2023, 2:25pm

If your tool doesn’t care about those things, throwing a data.table in is probably correct and fine (and I think you would do the same if you got a data.table from a parquet file). But if it does, you need something to do specific processing or extraction

Sure. Anything can export CSV, but a generic CSV reader isn’t going to support that kind of functionality.

bredelings · September 15, 2023, 2:32pm

For context, in evolutionary biology

everybody uses TSV instead, so there are a number of different tools that can read and analyze TSV.
mostly people visualize and explore traces with the Java tool Tracer (GitHub - beast-dev/tracer: Posterior summarisation in Bayesian phylogenetics)

I think its a bit unfortunate that the culture is so silo-ed. Most people in that community are unaware of Stan and the R and python packages that you all are familiar with.

WardBrian · September 15, 2023, 2:33pm

It’s also worth stating for the record that parquet would be a new option for output, not a strict replacement for csv. If you want a human readable format and are ok with the downsides, it will be available for the foreseeable future

ahartikainen · September 15, 2023, 4:46pm

I would also note that even parquet is not the optimal format, but it has many good properties: binary data, possibility for metadata

But also downsides: only support for a table format → user / package needs to know how to unpack variables, no groups etc.

Also given that parquet is column major format I wonder how fast is it to save 100k variables 1 draw chunks, but probably this is still better than csv/tsv. Same problem is with JSON unless one uses the JSON-data format.

Currently only hdf/netcdf/zarr support n-dimensional objects with annotation + metadata, but I vet they would have hard time to save 100k variables draw by draw.

Avro format would be great, but it is quite slow to read (at least with python packages).

WardBrian · September 15, 2023, 5:27pm

Arrow/Parquet support structured columns which contain more than one primitive value per row

bredelings · September 15, 2023, 5:32pm

The link to the Dremel paper here was helpful for me to understand support for nested values in parquet:

avehtari · October 3, 2023, 5:01pm

I’m looking forward being able to do convergence diagnostics without need to load all these 100k variables in memory at once (or ever).

Topic		Replies	Views
Notes on Stan Output Serialization Options (YAML, Protobuf, Avro, CBOR) Developers	13	3354	July 14, 2021
Proof of concept: Binary output format for cmdstan Developers	74	1304	April 8, 2026
Status of the IO re-factor? Developers	16	1441	September 18, 2020
Speed difference between rstan and cmdstan for a simple model CmdStan rstan , techniques	25	3880	November 7, 2021
Alternative .csv reader Developers maintenance , rstan	53	4280	September 11, 2018

JSON Output for STAN

Related topics