Bottom-Line Up Front: I’m struggling to use stan and python together, especially when I want to test out different models and do posterior-predictive checks. I cannot use R and its excellent stan-ecosystem, so I was wondering how other folks manage python (+ pandas) and stan.
I’m having trouble integrating stan into my python workflow, even with great tools like cmdstanpy and PyStan. I frequently struggle when doing posterior-predictive checks (PPCs), where I have to:
- transform a pandas dataframe to a stan-compatible dictionary (usually tedious)
- run the sampling and get the model fit (easy!)
- use summaries of my posterior distributions, and generated quantities to compare model estimates against the original data (very tedious)
This workflow is further complicated when I want to try different models, check my priors, run more generated quantities, etc. For example, testing a partially-pooled GLM vs. a non-pooled GLM adds substantial overhead to my code if I want to do any PPCs.
I think the crux of this issue is going between my data and stan-data (input data, sampled values, generated quantities). I think tidybayes explains it well:
Extracting tidy draws from the model. This often means extracting indices from parameters with names like
"b[1,1]"
,"b[1,2]"
into separate columns of a data frame, likei = c(1,1,..)
andj = c(1,2,...)
. More tediously, sometimes these indices actually correspond to levels of a factor in the original data; e.g."x[1]"
might correspond to a value ofx
for the first level of some factor.
Unfortunately, R is not an option at my workplace, which means I can’t use all of the excellent tools that have been developed for R. (It’s not an option because nobody uses R, and there isn’t a strong-enough data analysis contingent to justify adding it to our technical stack).
So my question is: how do folks use python (+ pandas) with stan, especially when they have to explore different models and make frequent use of the generated quantities block?