Parquet support would likely to be added to posterior package, which is currently used by cmdstanr and brms, supports handy presentation of draws as arrays, matrices, data frames, or rvars, and has better ESS/MCSE estimation than coda (see Comparison of MCMC effective sample size estimators). As you mention coda, I’m curious to know what are the key features for you in that package (off-topic for this thread, so you can also send me a private message)?
I actually mentioned coda because I’ve seen it mentioned in several statistics papers proposing new MCMC methodology. Since I’m interested in trying to create an interoperable file format, I was thinking that if coda is widely used, then substantial adoption of the format would mean that popular diagnostic packages would need to read the format.
However, based on what Brian and Jason said, I could simply provide a function to read the file format and create the type of R objects that it works on.
(Thanks for the ESS comparison link)
Is it currently the case that diagnostics tools exist (besides the provided
stansummary) that directly analyze the Stan CSV files?
Hmm. Is that the right question though? Even if no tools exists that can load a posterior sample from a CSV, such tools probably SHOULD exist.
My understanding is that most of these rely on you getting the data in memory yourself and then providing it for analysis (this is certainly the case in Python, where a package like Arviz can accept a cmdstanpy object, but it relies on cmdstanpy to actually populate the draws in that object from disk).
I was thinking that it should be pretty easy to convert an R data frame to whatever object one’s favor R diagnostics tool is using. Since you can read CSV using read.table, csv would then effectively be supported.
Is this not the case?
In evolutionary biology, everybody uses TSV instead, so there are a number of different tools that can read and analyze TSV. Converting CSV to TSV so that it can be analyzed using those tools can be done with a simple sed command.
The answer is “it depends”.
Does your diagnostic tool need to distinguish between “special” variables like the log density and parameters? Then it needs to know about the specific Stan convention of naming it lp__.
Does it need to know if warmup draws are included or excluded from the sample? Then it needs to read the incredibly Stan specific comments in that csv file to see if save_warmup was true, and if so, how many warmup iterations were used.
If your tool doesn’t care about those things, throwing a data.table in is probably correct and fine (and I think you would do the same if you got a data.table from a parquet file). But if it does, you need something to do specific processing or extraction
If your tool doesn’t care about those things, throwing a data.table in is probably correct and fine (and I think you would do the same if you got a data.table from a parquet file). But if it does, you need something to do specific processing or extraction
Sure. Anything can export CSV, but a generic CSV reader isn’t going to support that kind of functionality.
For context, in evolutionary biology
- everybody uses TSV instead, so there are a number of different tools that can read and analyze TSV.
- mostly people visualize and explore traces with the Java tool Tracer (GitHub - beast-dev/tracer: Posterior summarisation in Bayesian phylogenetics)
I think its a bit unfortunate that the culture is so silo-ed. Most people in that community are unaware of Stan and the R and python packages that you all are familiar with.
It’s also worth stating for the record that parquet would be a new option for output, not a strict replacement for csv. If you want a human readable format and are ok with the downsides, it will be available for the foreseeable future
I would also note that even parquet is not the optimal format, but it has many good properties: binary data, possibility for metadata
But also downsides: only support for a table format → user / package needs to know how to unpack variables, no groups etc.
Also given that parquet is column major format I wonder how fast is it to save 100k variables 1 draw chunks, but probably this is still better than csv/tsv. Same problem is with JSON unless one uses the JSON-data format.
Currently only hdf/netcdf/zarr support n-dimensional objects with annotation + metadata, but I vet they would have hard time to save 100k variables draw by draw.
Avro format would be great, but it is quite slow to read (at least with python packages).
Arrow/Parquet support structured columns which contain more than one primitive value per row
The link to the Dremel paper here was helpful for me to understand support for nested values in parquet:
I’m looking forward being able to do convergence diagnostics without need to load all these 100k variables in memory at once (or ever).