How do folks use python (+ pandas) and stan?

Bottom-Line Up Front: I’m struggling to use stan and python together, especially when I want to test out different models and do posterior-predictive checks. I cannot use R and its excellent stan-ecosystem, so I was wondering how other folks manage python (+ pandas) and stan.

I’m having trouble integrating stan into my python workflow, even with great tools like cmdstanpy and PyStan. I frequently struggle when doing posterior-predictive checks (PPCs), where I have to:

  • transform a pandas dataframe to a stan-compatible dictionary (usually tedious)
  • run the sampling and get the model fit (easy!)
  • use summaries of my posterior distributions, and generated quantities to compare model estimates against the original data (very tedious)

This workflow is further complicated when I want to try different models, check my priors, run more generated quantities, etc. For example, testing a partially-pooled GLM vs. a non-pooled GLM adds substantial overhead to my code if I want to do any PPCs.

I think the crux of this issue is going between my data and stan-data (input data, sampled values, generated quantities). I think tidybayes explains it well:

Extracting tidy draws from the model. This often means extracting indices from parameters with names like "b[1,1]" , "b[1,2]" into separate columns of a data frame, like i = c(1,1,..) and j = c(1,2,...) . More tediously, sometimes these indices actually correspond to levels of a factor in the original data; e.g. "x[1]" might correspond to a value of x for the first level of some factor.

Unfortunately, R is not an option at my workplace, which means I can’t use all of the excellent tools that have been developed for R. (It’s not an option because nobody uses R, and there isn’t a strong-enough data analysis contingent to justify adding it to our technical stack).

So my question is: how do folks use python (+ pandas) with stan, especially when they have to explore different models and make frequent use of the generated quantities block?

ArviZ might help you with this.

As for doing generated quantities stuff, I think you are probably going to have same problems with R (unless you use brms or similar interface).

Is the problem that you have multidimensional inputs in a table format, and then transforming them back to ndarray creates some “overhead”?

I haven’t heard of ArviZ, but it looks promising!

I see what you’re saying about the generated quantities. I’m not familiar with brms.

And, yes, most of the overhead comes from turning my data (a pandas dataframe) into a dictionary that I can pass to a stan model (e.g. via cmdstanpy.CmdStanModel) and transforming the stan-output so it is compatible with my original data (as a dataframe). A common example is extracting indices from paramters with names like b[1,2] into separate columns of a dataframe, and mapping them back to their original data.

I can think of a few different ways to handle this with python, but I find myself having to write much different code each time I want to try out a new model on the same data. I was wondering if others have had the same problems.

agreed this is tedious.
regarding the particularly tedious job of extracting parameter indices from names like b[1,2], the latest version of CmdStanPy has methods which allow you to retrieve the fitted estimates as a named variable, i.e., the structure of the estimate corresponds to the structure of the Stan program variable. the relevant methods are:

however, you need Arviz or similar to do downstream analysis

1 Like