JSON Output for STAN

Hi,

I’m working on a JSON output format for MCMC and other Monte Carlo samplers. It has a basic python tool to convert back and forth to TSV format. I’ve opened issues to highlight future work.

GitHub - bredelings/MCON

I saw that there was previously a proposal for JSON output format for STAN, here:

https://github.com/stan-dev/design-docs/blob/ebd39a0a43d1abc5ab1cda700f2f290e31efb27c/designs/0007-json_writer.md

Can I ask where that proposal stands at the moment?

-BenRI

1 Like

The JSON portions of that proposal were for the outputs that are not draws from the posterior, but things like the metric from adaptation or stepsize and other diagnostic information.

Those things are still being implemented. The other portions of that design doc (around binary output for the tabular data) are still ongoing.

1 Like

The other portions of that design doc (around binary output for the tabular data) are still ongoing.

Hijacking this thread: where can I give an opinion on this? I want to strongly advocate for parquet

1 Like

The design doc had already been merged, but Apache Arrow/Parquet was selected as the primary candidate for binary tabular data :)

2 Likes

The JSON portions of that proposal were for the outputs that are not draws from the posterior, but > things like the metric from adaptation or stepsize and other diagnostic information.

In the document I linked to, the JSON example has keys for “param_names” and “samples”. A binary format is listed an an “alternative”.

In any case

  • would anyone be interested in JSON output, in addition to CSV or binary output?
  • that document mentions using Boost.JSON for JSON output. I’ve used nlohmann::json previously, but Boost.JSON seems possibly superior to that, and also faster than rapidjson. Are there any plans to use a particular library for JSON output for the stuff that is not posterior samples?
1 Like

See this close design doc for JSON as output: JSON Sampling output by rok-cesnovar · Pull Request #14 · stan-dev/design-docs · GitHub

Among others, a big issue where things got to a halt was there was no consensus on how to represent special values (NaN, Inf, …) in JSON so that it would work with the most popular packages to read JSON in Python and R.

2 Likes

Apologies, I thought you had linked to https://github.com/stan-dev/design-docs/blob/master/designs/0032-stan-output-formats.md, which is a design-doc which was accepted and does include JSON (but not for samples)

1 Like

The issues with Inf/-Inf/NaN came up in the Boost.JSON project as well. The discussion here is pretty informative:

 https://github.com/boostorg/json/issues/397

One additional approach that is mentioned there is to write 1e9999 for Inf and -1e99999 for -Inf. Those are both valid JSON, and most parsers read those as Inf and -Inf respectively. That still leaves a somewhat unsatisfying null for NaN.

It looks like the Boost.JSON project now defaults to writing (1e999999,-1e99999,null) but allows the caller to choose to write (Infinity, -Infinity, NaN).

The parser has been changed to read Infinity, -Infinity, and NaN, following python. It seems like this is a slightly superset of JSON and shouldn’t change the meaning of any JSON documents.

So, I’ll propose (1e999999,-1e99999,null) for writing out samples. Although I like (Infinity, -Infinity, NaN) slightly better. What do you think?

I believe there are often two complaints about CSV as an output format:

  1. Being text based, you have to choose between accuracy and file size
  2. It is annoying to have to flatten objects with rich structure like matrices in order to output them as one row

The pros are that it is human readable (ish), and reasonably simple and fast to read into memory in most languages.

I think JSONL addresses #2 above, is probably about the same amount of human readable, but I think is less convenient to read in and has issues like the lack of full floating point support. It arguably makes #1 worse, since the field names are repeated for every single draw.

Something like parquet addresses both complaints, but is no longer human readable. It seems to have really quite good language support, though since it is not text based it is a bit less portable I think.

The refactors necessary to support something like parquet should make it reasonably easy to plug in your own JSONL writer, but I still don’t think it is likely to end up as an officially supported format in Stan

2 Likes

Hell yeah, always good taste around here

I’m looking at Parquet – very interesting. It looks like it would be possible to represent vectors of variable size using the “REPEATED” repetition type.

If Parquet output is added to Stan, would the Stan team work on getting parquet input added to packages like coda?

Btw, see here for a branch of cmdstan that uses Arrow

2 Likes

One that I’m going to have to update soon to keep using it 🤦

2 Likes

Pardon my ignorance, but would I be correct to surmise that both Arrow and Parquet can handle variable-shape records (e.g. records that contain a variable-length vector)?

I’m a bit concerned about a situation where every probabilistic programming software package defines their own output format, and that output can only be analyzed by a separate tool released along with the MCMC sampler.

I see for example that pymcmc can “save” samples in an HDF5 format.

6. Saving and managing sampling results — PyMC 2.3.6 documentation

Its not clear that anything besides pymcmc can read these files, whereas CSV – and hopefully the JSON format I’m proposing – would be readable by lots of different MCMC diagnostics programs.

I guess if Stan want to write out Parquet format, then it might be helpful to have a formal specification of what Stan is writing (that goes beyond merely the Parquet spec) so that it would be possible to write an inter-operable MCMC diagnostics package.

I guess this adds a 3rd point to Brian Ward’s list:
3. A clear specification allowing interoperability, and some effort to make sure that other MCMC diagnostics packages can load the output.

Is this on the radar?

Is it currently the case that diagnostics tools exist (besides the provided stansummary) that directly analyze the Stan CSV files?

My understanding is that most of these rely on you getting the data in memory yourself and then providing it for analysis (this is certainly the case in Python, where a package like Arviz can accept a cmdstanpy object, but it relies on cmdstanpy to actually populate the draws in that object from disk).

The same is currently true in R in the vast majority of cases.

That said, I’m not opposed to a standardized format across MCMC packages. Maybe that would open the door to newer tools that make use of that. But currently I agree with Brian about the state of things in at least R and Python.

1 Like

Well, I don’t think I this is a particularly big problem, as so far it seems projects are sticking with popular formats that have libraries for read/write in most languages.

For example, hdf5/netcdf is one of the most widely used format in lots of scientific disciplines.( It’s a bit dated now, and I’ve explicitly encountered it bottlenecked by high initialisation overhead for models with thousands of parameters.)

it’s hard enough to get consensus within a project on format, so I think it’d be folly to attempt standardization across formats beyond the status quo of individual projects opting for things that are mature enough to have solid support across languages.

1 Like

PyMC 2 is quite old.

Current PyMC mostly use InferenceData structure for its sample output.

https://python.arviz.org/en/latest/getting_started/XarrayforArviZ.html#xarray-for-arviz

https://python.arviz.org/en/stable/schema/schema.html#schema