Hi! Any chance an option could be added to the executable produced by stanc such that running the executable with “generate_quantities” produces a csv which includes all the data from the sample draws as well as those from the GQ block? That’s what stansummary expects and also what would work with shinystan, etc.
As it is now, I get a csv with only the quantities in the GQ block. I can see that being useful sometimes as well. But given how frequently workflows likely involve using tools that will want the sampler output in the same file, it seems worth putting into CmdStan rather than replicating the code to merge the csv files in every interface. Or am I missing something?
yes, we could make this change in the Stan services layer. it wasn’t implemented this way on the principle that services should be as simple and as modular as possible. in the same way that the output CSV file doesn’t contain the input data, the sample is a kind of input data, so why output it?
CmdStanPy lets you get the sample input data as well as the new gqs via option “inc_sample” on the methods draws, draws_pd, and draws_xr - API Reference — CmdStanPy 1.0.0 documentation
I think I’d seen that. I’d rather keep python out of the workflow. And every cmdstanX will want this. So seems easier to from cmdstan. But I’ll keep it in mind.
Okay. Will that address all the R tools? I get what you are saying about modular bits but maybe just add a combiner to the cmdstan suite of tools? There must already be code to load those files to a data structure and to write them out. That could be repurposed to merge sampler and GQ output, right?
For my thing, I’ll do it from haskell. But it’s a lot of reinventing the wheel.
I took a look at CmdStan’s stansummary program and it would be possible to add an input flag like “is_standalone_gq” in which case you could get the summary stats for the generated quantities variables - is that what you need?
if so, will file an issue for you on CmdStan.
But what I really need, what would leave the rest of the workflow intact, is the csv with the gq cols added (or replaced if the input also had those variables). That way I can summarize or load the samples into R for using shinystan and loo tools. But maybe my workflow here is unique. I’m fine just merging the csv in Haskell. It will incidentally get me one step closer to doing the summary myself instead of using stan summary and that would be a performance gain since I need only summarize the variables of interest if I do it myself.
If it will help with more typical work-flows, then raising an issue in summary seems wise. But don’t do it on my behalf. That’s not where I think it needs fixing.
CmdStan uses the core stan stan::io::stan_csv_reader to parse the Stan CSV file. this utility assembles the draws into an Eigen::Matrix named sample, the columns of which correspond to the CSV output columns.
CmdStan sends just the columns of this samples matrix which correspond to the model parameter variables to the standalone gq service which then calls the model’s write_output_array function repeatedly, row by row of the (sliced) samples matrix.
While it would be possible to output the parameters and transformed parameters, it isn’t possible to output the sampler variables as well - columns lp__ et al. I suspect both loo and shinystan will complain if these columns are missing.
Trying to copy the sampler output columns into the standalone gq output is just too ugly to contemplate, furthermore, this doesn’t really make sense, as there is no sampling done.
Bottom line: adding this feature would make the code that much more complicated and difficult to maintain. This is a consequence of building on an overloaded / brittle output format (Stan CSV) - now there’s a bunch of downstream dependencies. Extremely sorry - we made some bad choices a long time ago.
@mitzimorris
FWIW, I have this all working now. From the Haskell side I combine model and GQ data before the stan executable gets them (but keep them separate enough so the code can decide whether to re-run the model or just the GQ), and the merge the GQ columns and the sampler columns to produce output that can be used by all the downstream tools. All of it is carefully named so that file timestamps can be used to figure out what might need re-running when other things change.
None of it was that horrible. You were right that merging the sampler files was the most annoying part, but Haskell is good at parsers and I didn’t need to parse the “comment” parts, just keep them around. The rest was a little messy: tracking whether the GQ columns are replacements or additions, for example. But none of it was so bad.
And now I have the delightful result that I can run the model–e.g., estimating voter turnout and preference from national CCES data–once and post-stratify over various maps without re-running the much more time-intensive model.
I’m saying all this just by way of saying that it’s not that horrible to build and maybe the CmdStanXXX could do something similar.
Having CmdStan itself somehow manage this, even just via a wrapper of some sort, might be worthwhile if anyone else has an expensive model followed by a variety of relatively inexpensive post-stratifications (or other GQ work).
Also, though I agree that “no sampling is done” for the GQ run, it remains true that there is a sample that corresponds to each GQ result, and many uses/tools need or expect both to do anything useful. I’m not sure how else to address that, except to be able to somehow produce csv with the samples and GQ. But YMMV, obviously.
I just want to support you by stating my personal opinion that the current gq method should be replaced by a more general method which among other things would allow you to get exactly what you want. Though I don’t know when if at all this becomes a reality.
the current implementation was developed for a production-oriented use case, i.e., what was wanted was to generate a sample for some set of quantities of interest given a model, a set of fitted parameters and valid input data. for this use case, that the resulting CSV file contained just the generated quantities variables (plus CmdStan comment header) was a good thing - among other things, it’s a vanilla CSV format (unlike the Stan CSV formats).
that said, I just took a quick seplunking expedition through the code, and it would be possible to pass around information from the input sample, as well as output the parameters and transformed parameters values. I’d be happy to help anyone who wants to add more logic to the CmdStan and Stan layers to do this.
I can’t speak to what “more general” might be but I do think the most return for the least effort, would be an option (to the Stan executable) which would allow producing samples and GQ data in one csv file (with the sampler comments in all the weird places) or just the GQ data as it is now.
That immediately rescues all the downstream workflows using extant tools.
I’m sure there are other possibilities but I only know my workflow which relies on stansummary, and the R tools around shinystan and loo.