have we discussed a “collate” utility before which collects the output from the output.csv files for all chains for one run of the sampler and collates the draws into a single data file?
a simple version of this utility would go a long way towards making the standalone generated quantities workflow smoother.
a sophisticated version of this utility would produce an output file organized properly for downstream analysis - possibly no longer human readable, column-major format, etc.
right. expansion of *.csv may well contain stray csv files. also a problem for the collate utility - it needs to check the header of all files to make sure that configs match.
To help solve the stray csv files problem, the workflow I imagine for stanflow has a helper bash script, stan, that writes the csvs to a model dependent output directory. read_stan reads only csvs from this output directory.
this API requires the user to specify a name for the output csv files - no defaults. is this too unpythonic to contemplate?
there’s a corresponding branch in the cmdstanpy repo that has the wrappers to compile a model and run the sampler implemented. wrapping the cmdstan utilities stansummary and diagnose should be fairly trivial. it’s the last step - creating a PosteriorSample object in a way thats efficient for downstream processing that’s the concern.
Is the idea that rather than specifying a bunch of .csv files, you’d just specify one for CmdStan?
Does anything need to be done other than concatenation assuming we can ignore all the rest of the comments?
What really needs to happen is that the whole CSV parser needs to be refactored into a comment parser and CSV parser. But then we’re going to take the structured stuff and write it out with real structure, so probably no point in doing this [rewriting csv parser].