Cmdstanpy, Prediction Workflow, and Generated Quantitites

So this might not be just for cmdstanpy, but I’ve been using that a lot lately so that’s the interface I’ll use as an example. The question is about workflow and whether predictions should be done using generated quantities or done purely in Python.

  • Workflow 1: User trains a model and subsequently generates predictions from the model. In this case it’s straightforward to use generated quantities. Keeping track of the csv files is easy since the output generated by CmdStanMCMC can easily be referenced by CmdStanGQ.
  • Workflow 2: User trains the model on one machine (e.g. local or development environment), and then moves to a different machine (e.g. production environment) to generate predictions. For users, it’s a little awkward to move around csv files. In this situation I was thinking that it makes sense for the user to do predictions in Python instead of in Stan’s generated quantities block. The reason being that it’s easy to store the draws as a Python object, but it’s a bit awkward for the user to move around csv files. Thoughts?
  • Workflow 3: Similar to the above, the user trains the model on one machine (e.g. local or development environment), and then moves to a different machine (e.g. production environment) to generate predictions. But predictions in this situation involve “big data”. Is it better to work directly in Python or to use generated quantities since there’s a cost involved with moving data between Python and C++? (I’m not sure how severe this cost is though.) In this situation the data might be so big that the user might want to distribute the prediction computation on a cluster (e.g. using pyspark), which would be more straightforward to do in Python (again, because the user won’t have to keep track of csv files).

Just to be clear, this isn’t an ask to change anything in the cmdstanpy. As a developer, it think it’s useful that cmdstanpy is a true wrapper to cmdstan. But also as a developer it makes it tricky to streamline the user experience when building an interface to cmdstanpy. Also, I could totally be missing out on some functionality that cmdstan/cmdstanpy has that would improve these workflows.

2 Likes

@mitzimorris can surely provide some thoughts about this!

thanks for the nudge - been meaning to respond.

in current CmdStanPy:

workflow 1

easy peasy - that’s why it exists

workflow 2

question: is dealing with pickled objects easier than sets of CSV files?

would a standalone-gc specialied pickle function help?

workflow 3 - predictions

predictions in this situation involve “big data”

same question - assuming these predictions are still just using the model parameters and the big data can be distributed, would standalone-gq specialized pickle function help?

1 Like

So I thought about this some more and realized that my issue is less about how to move around the files (pickle, hdf5, etc) and more about whether I should use generated quantities (using csv files) or just write out the prediction math in python in the prod environment (using the saved draws). In either case, something will have to be moved between prod/dev environments, whether it be the csv files or the draws.

Writing out the prediction math means that,

  • I don’t have to deal with convincing infrastructure engineers to install/maintain cmdstanpy (or other interface) and can use numpy/pandas/pyspark for the prediction computation (since those are typically already available in a prod environment).
  • I don’t have to deal with large amounts of data being moved between the Stan interface language and cpp.

Using generated quantities means that,

  • I can use the Stan language for the predictions making the code more readable and similar to what was used to develop the model.
  • I get to take advantage of using cpp to compute the predictions.

Not sure if anyone has looked into which of these might be more efficient from an implementation and/or a speed of computation point-of-view. (It could be that my use case too specific.)

What we would need is to have a python library that brings Stan functions to Python (e.g with pybind11) which then could be imported.

If it could also compile user defined functions (and cache them), that would be great.

Then it would be “safer” to move predictions from generated quantities to python (e.g. use ArviZ).

just filed an issue on CmdStanPy: https://github.com/stan-dev/cmdstanpy/issues/318

CmdStanPy’s generate_quantities method should be able to handle this workflow - and we need better documentation and examples of how to do this.

1 Like

the problem is the CmdStan is fundamentally a file-driven interface. if you’re using different machines for fitting a model and the running predictions and you want to use Stan for both (for the reasons you mention above - readable code, efficient computation) then you need installations on both machines and you need to ship the posterior draws from the fitted model to the production machine.

we should be able to make this something that an ML-pipeline engineer can handle - I’ll take another look at what CmdStan and the underlying Stan service require w/r/t the input csv.

update: took another look, writeup here: https://github.com/stan-dev/cmdstanpy/issues/318#issuecomment-718984675