Cmdstanpy, Prediction Workflow, and Generated Quantitites

So this might not be just for cmdstanpy, but I’ve been using that a lot lately so that’s the interface I’ll use as an example. The question is about workflow and whether predictions should be done using generated quantities or done purely in Python.

  • Workflow 1: User trains a model and subsequently generates predictions from the model. In this case it’s straightforward to use generated quantities. Keeping track of the csv files is easy since the output generated by CmdStanMCMC can easily be referenced by CmdStanGQ.
  • Workflow 2: User trains the model on one machine (e.g. local or development environment), and then moves to a different machine (e.g. production environment) to generate predictions. For users, it’s a little awkward to move around csv files. In this situation I was thinking that it makes sense for the user to do predictions in Python instead of in Stan’s generated quantities block. The reason being that it’s easy to store the draws as a Python object, but it’s a bit awkward for the user to move around csv files. Thoughts?
  • Workflow 3: Similar to the above, the user trains the model on one machine (e.g. local or development environment), and then moves to a different machine (e.g. production environment) to generate predictions. But predictions in this situation involve “big data”. Is it better to work directly in Python or to use generated quantities since there’s a cost involved with moving data between Python and C++? (I’m not sure how severe this cost is though.) In this situation the data might be so big that the user might want to distribute the prediction computation on a cluster (e.g. using pyspark), which would be more straightforward to do in Python (again, because the user won’t have to keep track of csv files).

Just to be clear, this isn’t an ask to change anything in the cmdstanpy. As a developer, it think it’s useful that cmdstanpy is a true wrapper to cmdstan. But also as a developer it makes it tricky to streamline the user experience when building an interface to cmdstanpy. Also, I could totally be missing out on some functionality that cmdstan/cmdstanpy has that would improve these workflows.

2 Likes

@mitzimorris can surely provide some thoughts about this!

thanks for the nudge - been meaning to respond.

in current CmdStanPy:

workflow 1

easy peasy - that’s why it exists

workflow 2

question: is dealing with pickled objects easier than sets of CSV files?

would a standalone-gc specialied pickle function help?

workflow 3 - predictions

predictions in this situation involve “big data”

same question - assuming these predictions are still just using the model parameters and the big data can be distributed, would standalone-gq specialized pickle function help?

1 Like