CmdStan utility "collate"?

have we discussed a “collate” utility before which collects the output from the output.csv files for all chains for one run of the sampler and collates the draws into a single data file?

a simple version of this utility would go a long way towards making the standalone generated quantities workflow smoother.

a sophisticated version of this utility would produce an output file organized properly for downstream analysis - possibly no longer human readable, column-major format, etc.

That sounds good.

I assumed people were already doing cat trick so I implemented that also in to arviz.

cat output*.csv > combined_output.csv

https://arviz-devs.github.io/arviz/generated/arviz.from_cmdstan.html#arviz.from_cmdstan

Edit. cat trick is a bit rough, but collate could also add colums which is more advanced method.

1 Like

right. expansion of *.csv may well contain stray csv files. also a problem for the collate utility - it needs to check the header of all files to make sure that configs match.

1 Like

Good idea. Are you imagining a utility function like bin/stansummary?

Like Ari, I’ve been working on a thin (few dependencies) post-processing workflow to cmdstan. As from_cmdstan is to arviz, read_stan is to stanflow. Instead of collate, I chose the word combine, e.g. https://github.com/roualdes/stanflow/blob/master/examples/dirichlet/dirichlet.ipynb

To help solve the stray csv files problem, the workflow I imagine for stanflow has a helper bash script, stan, that writes the csvs to a model dependent output directory. read_stan reads only csvs from this output directory.

2 Likes

Stanflow looks good.

yes, exactly.

what Ari said - stanflow looks good!

the spec for cmdstanpy - https://github.com/stan-dev/design-docs/blob/master/designs/0002-cmdstanpy_func_spec.md
defines a sample command which returns a RunSet object that contains the names of the per-chain stan-csv files, so it’s easy to get just the csv files for a given run.

this API requires the user to specify a name for the output csv files - no defaults. is this too unpythonic to contemplate?

there’s a corresponding branch in the cmdstanpy repo that has the wrappers to compile a model and run the sampler implemented. wrapping the cmdstan utilities stansummary and diagnose should be fairly trivial. it’s the last step - creating a PosteriorSample object in a way thats efficient for downstream processing that’s the concern.

So you’d cat them all together

cat a.csv b.csv c.csv d.csv > abcd.csv
gqs ... abcd.csv

insead of

gqs ... a.csv b.csv c.csv d.csv

Is the idea that rather than specifying a bunch of .csv files, you’d just specify one for CmdStan?

Does anything need to be done other than concatenation assuming we can ignore all the rest of the comments?

What really needs to happen is that the whole CSV parser needs to be refactored into a comment parser and CSV parser. But then we’re going to take the structured stuff and write it out with real structure, so probably no point in doing this [rewriting csv parser].

yes, that’s exactly right. assumption is that, continuing with above example, files a.csv ... d.csv correspond to running 4 chains.

checking that for all files they have the same number of rows and columns and that the column names match.