How do folks use python (+ pandas) and stan?

pj-trainor · August 31, 2020, 2:09pm

Bottom-Line Up Front: I’m struggling to use stan and python together, especially when I want to test out different models and do posterior-predictive checks. I cannot use R and its excellent stan-ecosystem, so I was wondering how other folks manage python (+ pandas) and stan.

I’m having trouble integrating stan into my python workflow, even with great tools like cmdstanpy and PyStan. I frequently struggle when doing posterior-predictive checks (PPCs), where I have to:

transform a pandas dataframe to a stan-compatible dictionary (usually tedious)
run the sampling and get the model fit (easy!)
use summaries of my posterior distributions, and generated quantities to compare model estimates against the original data (very tedious)

This workflow is further complicated when I want to try different models, check my priors, run more generated quantities, etc. For example, testing a partially-pooled GLM vs. a non-pooled GLM adds substantial overhead to my code if I want to do any PPCs.

I think the crux of this issue is going between my data and stan-data (input data, sampled values, generated quantities). I think tidybayes explains it well:

Extracting tidy draws from the model. This often means extracting indices from parameters with names like "b[1,1]" , "b[1,2]" into separate columns of a data frame, like i = c(1,1,..) and j = c(1,2,...) . More tediously, sometimes these indices actually correspond to levels of a factor in the original data; e.g. "x[1]" might correspond to a value of x for the first level of some factor.

Unfortunately, R is not an option at my workplace, which means I can’t use all of the excellent tools that have been developed for R. (It’s not an option because nobody uses R, and there isn’t a strong-enough data analysis contingent to justify adding it to our technical stack).

So my question is: how do folks use python (+ pandas) with stan, especially when they have to explore different models and make frequent use of the generated quantities block?

ahartikainen · August 31, 2020, 2:16pm

ArviZ might help you with this.

As for doing generated quantities stuff, I think you are probably going to have same problems with R (unless you use brms or similar interface).

Is the problem that you have multidimensional inputs in a table format, and then transforming them back to ndarray creates some “overhead”?

https://arviz-devs.github.io/arviz/index.html

pj-trainor · August 31, 2020, 2:44pm

I haven’t heard of ArviZ, but it looks promising!

I see what you’re saying about the generated quantities. I’m not familiar with brms.

And, yes, most of the overhead comes from turning my data (a pandas dataframe) into a dictionary that I can pass to a stan model (e.g. via cmdstanpy.CmdStanModel) and transforming the stan-output so it is compatible with my original data (as a dataframe). A common example is extracting indices from paramters with names like b[1,2] into separate columns of a dataframe, and mapping them back to their original data.

I can think of a few different ways to handle this with python, but I find myself having to write much different code each time I want to try out a new model on the same data. I was wondering if others have had the same problems.

mitzimorris · August 31, 2020, 3:58pm

agreed this is tedious.
regarding the particularly tedious job of extracting parameter indices from names like b[1,2], the latest version of CmdStanPy has methods which allow you to retrieve the fitted estimates as a named variable, i.e., the structure of the estimate corresponds to the structure of the Stan program variable. the relevant methods are:

stan_variable(name: str) - https://cmdstanpy.readthedocs.io/en/latest/api.html#cmdstanpy.CmdStanMCMC.stan_variable
stan_variables()
stan_variable_dims()

however, you need Arviz or similar to do downstream analysis

Topic		Replies	Views
How to use a fitted stan model for production (get posterior predictive distributions) in python Interfaces rstan , posterior-predictive , python	1	47	October 14, 2024
Recommended way of making posterior predictive checks available in Stan based packages General	2	1114	June 15, 2017
Generated quantities for prediction Modeling specification , posterior-predictive	9	525	June 5, 2024
How to run generate quantities with PyStan PyStan techniques , cmdstanpy	7	600	September 29, 2023
Running stan in R with only generated quantities RStan	7	2463	May 27, 2021

How do folks use python (+ pandas) and stan?

Related topics