Multiple data file input

I’ve been building a Haskell wrapper around some of stan, using CmdStan as the interface. One thing I’ve been slowly trying to get right, is caching data and results so that I don’t re-run things when not required. This comes up for me particularly when I want to post-stratify different data using the same model result. What would make this simpler–and I will handle it from the Haskell side for now–would be if the stan executable produced by stanc would accept multiple data files as input, combine them by merging the dictionaries, erroring on duplicate entries and otherwise proceed as with the merged dictionary.

This would allow putting the model data and the PS data (really any data only required by the generated quantities block) in separate json files.

What’s nice about that, aside from not needing a lot of redundant data in files for multiple post-stratifications, is one can then use file modification time comparisons to decide between re-running the entire model (when the model data has changed) and just running the generated quantities block against the previous results (when only the PS data is new).

I could imagine a fancier interface with a new “GQ_file=” sort of thing, but I’m not sure that’s necessary or what use the executable could make of it. But the convenience feature of allowing multiple files and trivially merging the json is likely straightforward and would be completely backwards compatible.

Does this seem like a reasonable thing to add?

Adam

1 Like

this is awesome! I’m not sure how differently things would be architected for Haskell, but you might consider looking at CmdStanPy and/or CmdStanR which share basic organization and which have been developed with the goal of being as similar are possible in terms of functionality and names of things.

maybe this is pedantic, but the stan executable is the result of compiling the model.hpp file produced by stanc together with the CmdStan command.hpp wrapper which handles the input file. unfortunately, the current CmdStan argument parser only allows a single data input file, and changing this is impossible far too much work to be worth it.

the CmdStan wrapper interfaces provide methods which translate in-memory data dictionaries to JSON input files - e.g. API Reference — CmdStanPy 1.2.0 documentation. basically, the wrapper interfaces do a lot of the file mgmt and timestamp checking.

this is a great use case, but it would require tracking the original input data, which is not done in CmdStanPy, because there’s a tension between providing a good general-purpose usable and convenient interface and a full-fledged workflow tracker.

Thanks for the reply!
Why is it impossible in the argument parser? Given that it’s entirely backward compatible–except for not erroring when a previous version would have–I would think it wouldn’t cause much trouble. But I’d love to understand why that’s not the case!

Anyway, I guess I’ll just handle it on the Haskell side. I build the json from whatever data sources are in use, each coming from some collection of row-like data. I’ll build it as two files and then merge them in something temporary before calling the model.

Which, I guess, is what you’re saying the other wrappers do (or would do) as well.

If I get further along, and cover more of Stan–the Haskell stuff has pieces of a “transpiler”, producing the stan code from a Haskell DSL, and it is woefully incomplete (as well as less type-safe than I would like)–I’ll certainly consider re-architecting or adding a wrapper to more closely match the other CmdStanXXX, though I’ve not looked at them yet!

The core part of haskell-stan doesn’t track input data either, but I have a caching/depenndency layer which does and it uses the timestamp on the json to decide if the input data has changed. Right now, that means I re-run the entire model when just the PS data has changed. Anyway, I can work around it, I just figured other people might have this exact issue and thus make it worth changing in the CmdStan interface. But obviously not if that’s difficult!

Thanks again.

Why would that be impossible? Do you mean the parser cannot allow multiple arguments with the same name? Like, the following is always an error:

./bernoulli sample data file=data1.json file=data2.json

(actually, looks like the parser accepts that, it just ignores data1.json…)

The data file is an arbitrary string though, you could just make it a comma-separated list of filenames

./bernoulli sample data file=data1.json,data2.json

Although that’s kind of backwards-incompatible if you had a filename with a comma in it.

CmdStan needs to keep its arg parser for backwards compatibility.

there is a much better C++ arg parser available - cli11 - which is now plugged into CmdStan’s stansummary and there’s this branch of CmdStan which is a proof-of-concept but needs work: GitHub - stan-dev/cmdstan at feature/929-stansummary-cli11

I have, in the past, made changes to CmdStan arg parser code, and I have fixed bugs in it. the data structures are over-factored and brittle, the accompanying unit tests are beyond baroque (rococo). given this, and based on past attempts to improve the CmdStan interface, I’m pretty sure we’re in for a lot of work and a torrent of objections. much better to accept what we’ve got and move on.