CmdStan utility "collate"?

mitzimorris · February 7, 2019, 5:58pm

have we discussed a “collate” utility before which collects the output from the output.csv files for all chains for one run of the sampler and collates the draws into a single data file?

a simple version of this utility would go a long way towards making the standalone generated quantities workflow smoother.

a sophisticated version of this utility would produce an output file organized properly for downstream analysis - possibly no longer human readable, column-major format, etc.

ahartikainen · February 7, 2019, 6:32pm

That sounds good.

I assumed people were already doing cat trick so I implemented that also in to arviz.

cat output*.csv > combined_output.csv

https://arviz-devs.github.io/arviz/generated/arviz.from_cmdstan.html#arviz.from_cmdstan

Edit. cat trick is a bit rough, but collate could also add colums which is more advanced method.

mitzimorris · February 7, 2019, 6:51pm

right. expansion of *.csv may well contain stray csv files. also a problem for the collate utility - it needs to check the header of all files to make sure that configs match.

roualdes · February 7, 2019, 8:58pm

Good idea. Are you imagining a utility function like bin/stansummary?

Like Ari, I’ve been working on a thin (few dependencies) post-processing workflow to cmdstan. As from_cmdstan is to arviz, read_stan is to stanflow. Instead of collate, I chose the word combine, e.g. https://github.com/roualdes/stanflow/blob/master/examples/dirichlet/dirichlet.ipynb

To help solve the stray csv files problem, the workflow I imagine for stanflow has a helper bash script, stan, that writes the csvs to a model dependent output directory. read_stan reads only csvs from this output directory.

ahartikainen · February 7, 2019, 9:10pm

Stanflow looks good.

mitzimorris · February 7, 2019, 10:08pm

yes, exactly.

what Ari said - stanflow looks good!

mitzimorris · February 7, 2019, 10:45pm

the spec for cmdstanpy - https://github.com/stan-dev/design-docs/blob/master/designs/0002-cmdstanpy_func_spec.md
defines a sample command which returns a RunSet object that contains the names of the per-chain stan-csv files, so it’s easy to get just the csv files for a given run.

this API requires the user to specify a name for the output csv files - no defaults. is this too unpythonic to contemplate?

there’s a corresponding branch in the cmdstanpy repo that has the wrappers to compile a model and run the sampler implemented. wrapping the cmdstan utilities stansummary and diagnose should be fairly trivial. it’s the last step - creating a PosteriorSample object in a way thats efficient for downstream processing that’s the concern.

Bob_Carpenter · February 8, 2019, 12:53am

So you’d cat them all together

cat a.csv b.csv c.csv d.csv > abcd.csv
gqs ... abcd.csv

insead of

gqs ... a.csv b.csv c.csv d.csv

Is the idea that rather than specifying a bunch of .csv files, you’d just specify one for CmdStan?

Does anything need to be done other than concatenation assuming we can ignore all the rest of the comments?

What really needs to happen is that the whole CSV parser needs to be refactored into a comment parser and CSV parser. But then we’re going to take the structured stuff and write it out with real structure, so probably no point in doing this [rewriting csv parser].

mitzimorris · February 8, 2019, 2:29am

yes, that’s exactly right. assumption is that, continuing with above example, files a.csv ... d.csv correspond to running 4 chains.

checking that for all files they have the same number of rows and columns and that the column names match.

Topic		Replies	Views
Reading cmdstanr csv files CmdStan	2	415	October 16, 2023
CmdStan generate_quantities and stansummary CmdStan	14	1468	January 9, 2022
Combining posterior data from multiple chains when saving .csv output from CmdStanR inference object Other cmdstanr , posterior-package	2	862	February 1, 2022
Specifying output file names CmdStan cmdstanpy	32	1933	July 31, 2020
Importing large cmdstan csv-files to R General	25	3759	June 28, 2021

CmdStan utility "collate"?

Related topics