JSON Output for STAN

bredelings · September 12, 2023, 6:08pm

Hi,

I’m working on a JSON output format for MCMC and other Monte Carlo samplers. It has a basic python tool to convert back and forth to TSV format. I’ve opened issues to highlight future work.

GitHub - bredelings/MCON

I saw that there was previously a proposal for JSON output format for STAN, here:

https://github.com/stan-dev/design-docs/blob/ebd39a0a43d1abc5ab1cda700f2f290e31efb27c/designs/0007-json_writer.md

Can I ask where that proposal stands at the moment?

-BenRI

WardBrian · September 12, 2023, 6:21pm

The JSON portions of that proposal were for the outputs that are not draws from the posterior, but things like the metric from adaptation or stepsize and other diagnostic information.

Those things are still being implemented. The other portions of that design doc (around binary output for the tabular data) are still ongoing.

laifuthegreat · September 12, 2023, 8:09pm

The other portions of that design doc (around binary output for the tabular data) are still ongoing.

Hijacking this thread: where can I give an opinion on this? I want to strongly advocate for parquet

WardBrian · September 12, 2023, 8:15pm

The design doc had already been merged, but Apache Arrow/Parquet was selected as the primary candidate for binary tabular data :)

bredelings · September 12, 2023, 8:42pm

The JSON portions of that proposal were for the outputs that are not draws from the posterior, but > things like the metric from adaptation or stepsize and other diagnostic information.

In the document I linked to, the JSON example has keys for “param_names” and “samples”. A binary format is listed an an “alternative”.

In any case

would anyone be interested in JSON output, in addition to CSV or binary output?
that document mentions using Boost.JSON for JSON output. I’ve used nlohmann::json previously, but Boost.JSON seems possibly superior to that, and also faster than rapidjson. Are there any plans to use a particular library for JSON output for the stuff that is not posterior samples?

rok_cesnovar · September 13, 2023, 7:13am

See this close design doc for JSON as output: JSON Sampling output by rok-cesnovar · Pull Request #14 · stan-dev/design-docs · GitHub

Among others, a big issue where things got to a halt was there was no consensus on how to represent special values (NaN, Inf, …) in JSON so that it would work with the most popular packages to read JSON in Python and R.

WardBrian · September 13, 2023, 3:38pm

Apologies, I thought you had linked to https://github.com/stan-dev/design-docs/blob/master/designs/0032-stan-output-formats.md, which is a design-doc which was accepted and does include JSON (but not for samples)

bredelings · September 13, 2023, 4:33pm

The issues with Inf/-Inf/NaN came up in the Boost.JSON project as well. The discussion here is pretty informative:

 https://github.com/boostorg/json/issues/397

One additional approach that is mentioned there is to write 1e9999 for Inf and -1e99999 for -Inf. Those are both valid JSON, and most parsers read those as Inf and -Inf respectively. That still leaves a somewhat unsatisfying null for NaN.

It looks like the Boost.JSON project now defaults to writing (1e999999,-1e99999,null) but allows the caller to choose to write (Infinity, -Infinity, NaN).

The parser has been changed to read Infinity, -Infinity, and NaN, following python. It seems like this is a slightly superset of JSON and shouldn’t change the meaning of any JSON documents.

So, I’ll propose (1e999999,-1e99999,null) for writing out samples. Although I like (Infinity, -Infinity, NaN) slightly better. What do you think?

WardBrian · September 13, 2023, 9:29pm

I believe there are often two complaints about CSV as an output format:

Being text based, you have to choose between accuracy and file size
It is annoying to have to flatten objects with rich structure like matrices in order to output them as one row

The pros are that it is human readable (ish), and reasonably simple and fast to read into memory in most languages.

I think JSONL addresses #2 above, is probably about the same amount of human readable, but I think is less convenient to read in and has issues like the lack of full floating point support. It arguably makes #1 worse, since the field names are repeated for every single draw.

Something like parquet addresses both complaints, but is no longer human readable. It seems to have really quite good language support, though since it is not text based it is a bit less portable I think.

The refactors necessary to support something like parquet should make it reasonably easy to plug in your own JSONL writer, but I still don’t think it is likely to end up as an officially supported format in Stan

laifuthegreat · September 14, 2023, 2:04pm

Hell yeah, always good taste around here

bredelings · September 14, 2023, 2:41pm

I’m looking at Parquet – very interesting. It looks like it would be possible to represent vectors of variable size using the “REPEATED” repetition type.

If Parquet output is added to Stan, would the Stan team work on getting parquet input added to packages like coda?

mike-lawrence · September 14, 2023, 2:51pm

Btw, see here for a branch of cmdstan that uses Arrow

sakrejda · September 14, 2023, 3:13pm

One that I’m going to have to update soon to keep using it 🤦

bredelings · September 14, 2023, 4:01pm

Pardon my ignorance, but would I be correct to surmise that both Arrow and Parquet can handle variable-shape records (e.g. records that contain a variable-length vector)?

bredelings · September 14, 2023, 4:13pm

I’m a bit concerned about a situation where every probabilistic programming software package defines their own output format, and that output can only be analyzed by a separate tool released along with the MCMC sampler.

I see for example that pymcmc can “save” samples in an HDF5 format.

6. Saving and managing sampling results — PyMC 2.3.6 documentation

Its not clear that anything besides pymcmc can read these files, whereas CSV – and hopefully the JSON format I’m proposing – would be readable by lots of different MCMC diagnostics programs.

I guess if Stan want to write out Parquet format, then it might be helpful to have a formal specification of what Stan is writing (that goes beyond merely the Parquet spec) so that it would be possible to write an inter-operable MCMC diagnostics package.

I guess this adds a 3rd point to Brian Ward’s list:
3. A clear specification allowing interoperability, and some effort to make sure that other MCMC diagnostics packages can load the output.

Is this on the radar?

WardBrian · September 14, 2023, 4:46pm

Is it currently the case that diagnostics tools exist (besides the provided stansummary) that directly analyze the Stan CSV files?

My understanding is that most of these rely on you getting the data in memory yourself and then providing it for analysis (this is certainly the case in Python, where a package like Arviz can accept a cmdstanpy object, but it relies on cmdstanpy to actually populate the draws in that object from disk).

jonah · September 14, 2023, 5:57pm

The same is currently true in R in the vast majority of cases.

jonah · September 14, 2023, 6:00pm

That said, I’m not opposed to a standardized format across MCMC packages. Maybe that would open the door to newer tools that make use of that. But currently I agree with Brian about the state of things in at least R and Python.

mike-lawrence · September 14, 2023, 8:12pm

Well, I don’t think I this is a particularly big problem, as so far it seems projects are sticking with popular formats that have libraries for read/write in most languages.

For example, hdf5/netcdf is one of the most widely used format in lots of scientific disciplines.( It’s a bit dated now, and I’ve explicitly encountered it bottlenecked by high initialisation overhead for models with thousands of parameters.)

it’s hard enough to get consensus within a project on format, so I think it’d be folly to attempt standardization across formats beyond the status quo of individual projects opting for things that are mature enough to have solid support across languages.

ahartikainen · September 15, 2023, 7:54am

PyMC 2 is quite old.

Current PyMC mostly use InferenceData structure for its sample output.

https://python.arviz.org/en/latest/getting_started/XarrayforArviZ.html#xarray-for-arviz

https://python.arviz.org/en/stable/schema/schema.html#schema

Topic		Replies	Views
Request for comments: JSON Sampling output Developers	4	601	February 16, 2020
Notes on Stan Output Serialization Options (YAML, Protobuf, Avro, CBOR) Developers	13	3175	July 14, 2021
Status of the IO re-factor? Developers	16	1271	September 18, 2020
MCMC Monitor -- Online monitoring of Stan runs General shinystan , visualisation	3	709	June 1, 2023
Usage of Arrow with Stan Interfaces	8	676	January 20, 2023

JSON Output for STAN

Related topics